Departure again, although Service Mesh has a long way to go, it is especially expected

Posted Jun 16, 202019 min read

Foreword

Service Mesh Comics

Almost everyone is talking about Service Mesh; it seems that no one knows how to land Service Mesh; but everyone feels that other people are doing Service Mesh vigorously; so everyone declares that they are doing Service Mesh.

The above is just a joke, but to some extent reflects some actual conditions:Service Mesh is a design idea and concept, rather than a specific architecture or implementation, although the configuration of Istio+Envoy seems to have become the de facto standard. We looked around, but found that the ideal is too full, the reality is too skinny, because the current practical reasons of various enterprises, resulting in a variety of forms of Service Mesh blooming.

The service mesh of Ant Financial is one of the flowers mentioned above. We have passed the exploration period and entered the production application. Last year's Double Eleven completed the transaction payment core link and production-level verification of hundreds of thousands of container scales. However, there are still many different voices about Service Mesh in the industry. On the one hand, it is supported by the stars, and on the other hand, it is confused and questioned, including questions about value, architecture and performance. So what is our attitude towards this? After the in-depth practice of Double Eleven, where is the service mesh of Ant Financial? Is the Service Mesh architecture the end?

This article will combine Ant Financial's internal actual scenes and thinking to describe the planning and continuous evolution of Ant Financial's service mesh road after the Double Eleven in 2019.

Ant Financial Service Mesh Practice Review

Ant Financial Service Double Eleven Service Mesh Practice Architecture

The picture above is the practical architecture of Ant Financial's Double Eleven in 2019. The cloud-native network proxy MOSN( https://github.com/mosn ) as Ant Financial's self-research Data plane products carry east-west traffic of Mesh architecture. For the control plane, based on the pragmatic premise, we explored a set of practical solutions at the current stage, and implemented the Service Mesh architecture based on the traditional service discovery system.

service Mesh practice

Here is a summary of the implementation of data. While satisfying the business, we have truly achieved low invasion of the business:extremely low resource consumption and rapid iteration ability. Both the business and the basic technology enjoy the dividends brought by cloud-native Mesh .

Service Mesh has a long way to go

Software Architecture and Design Trend Report

Let's take a look at the Software Architecture and Design trend report released by InfoQ in April 2020. Service Mesh is currently in the Early Adoption generation and is still in the hot stage in the cloud native technology circle. We can see the Mesh architecture special session in various technical forums In this article, we don't discuss too much about the selection, use scenarios, and rationality of Service Mesh. The students who need it can refer to a historical article at the end of the following text. There are many ant financial services thinking about Service Mesh.

For us, since we chose this path after in-depth thinking, and carried out in-depth practice on the Double Eleven last year, then how should we move to the middle of the game, in addition to pragmatic landing, we also have to look up at the starry sky , You must know what other Gap are there from the poem and the distance:

Non-Comprehensive Cloud Native

As mentioned earlier, when we landed on Service Mesh, we still used the traditional microservice system. Although the overall architecture is based on K8s, there is no community plan on the control plane. Of course, these are considered, but as the overall architecture evolves, Non-full cloud native will inevitably become the biggest obstacle to our continued enjoyment of cloud native dividends.

Insufficient platform capabilities

The positioning of Service Mesh is to decouple the infrastructure and business, but at present it seems that whether it is the combination of Istio+Envoy in the community or the traditional microservices + MOSN practice of Ant Financial, the management of east-west traffic is the focus. Poetry and distance still have a long way to go. There is still a lot of infrastructure logic embedded in the business system as an SDK, and we still have to face the impact of infrastructure upgrades on the business.

Incomplete coverage of border traffic

As cloud native becomes more and more intense inside the data center, but for the data center boundary and edge network, the seven-layer application network traffic still has not formed a global system. Due to the lack of the system, we have to split the development between the border gateway and the Mesh network. , Have independent traffic scheduling system and security and trust system.

Low ecological integration

The traditional service system has developed for so many years and has accumulated a lot of valuable wealth. Service Mesh emerged as a new upstart. From two perspectives:Service Mesh needs the integration and support of the traditional service system in order to migrate the existing business to the Mesh system; at the same time, the traditional The components of the service system also need to have the ability to integrate with the Mesh system to remain competitive.

Performance

Performance is a common issue, and there are endless voices that question performance in the Mesh architecture, including the Mixer control plane, as well as the additional network consumption and codec consumption caused by the introduction of Sidecar. However, we can see that the community has been solving these problems, including the reconstruction of the Mixer architecture, the introduction of ebpf to accelerate traffic hijacking, and so on.

In summary, we have a long way to go in Service Mesh.

Carry out Service Mesh to the end

This year our goal is to fully cover the main business of Mesh, which will face very big challenges:

  • The requirements of financial security and credibility require us to achieve full link encryption and service authentication;
  • Unify Sidecar and Ingress Web Server;
  • The landing of the cloud native control surface;
  • Transparent hijacking ability;
  • Need to carry more middleware capacity sinking;

The above analyzes the various problems that currently exist, and at the same time combines with Ant Financial's own business development needs, then we can clearly treat the symptoms, we abstract the above problems into three categories, and carry out special tackling:

  • Use open source ecological construction to deal with ecological integration issues;
  • Solve the problem of non-comprehensive cloud native through the evolution of cloud native standards;
  • Finally, through the enhancement of basic core capabilities, to manage platform capabilities, covering the problem of insufficient scenarios and performance;

Open source ecological construction

Let's review the first action we made after the Double Eleven:On the ninth issue of Service Mesh Meetup hosted by Ant Financial on December 28, 2019, we announced that MOSN completed the incubation on SOFAStack and began to be independent. Operate, seek cooperation and build partners with a more open attitude:

_ We believe that the future will belong more to those who bid farewell to the cathedral and embrace the bazaar. "Cathedral and Market"_

While declaring independent operation, we also made a series of measures:

  • Independent project domain name:mosn.io
  • Project address:github.com/mosn/mosn
  • Community organization:MOSN Community Organization
  • Project management regulations:PMC, Committer election promotion mechanism, etc.

Next, we continue to do a lot of things in the open source community, including the creation of special working groups, such as Isito WG, Dubbo WG and so on.

MOSN open source community status

At the same time, we have also sought a lot of external cooperation. More than half of the contributors are from outside. We accepted the first Committer directly from BOSS and so on. For ecological integration, we have conducted in-depth cooperation with the Skywalking, Sentinel and Dubbo-go communities. .

Skywalking

Skywalking

Call dependency and call status between service and service is an important indicator in microservice management. Skywalking is an excellent APM software in this field. MOSN has cooperated with the Skywalking community to carry out in-depth integration of the two systems and currently supports:

  • Call link topology display;
  • QPS monitoring;
  • Fine-grained RT display;

In May of this year, SkyWalking version 8.0 was fully upgraded. With the new probe protocol and analysis logic, the probe will be more mutual-aware and better use the probe under Service Mesh for monitoring. At the same time, SkyWalking will open the Metrics indicator analysis system that only existed in the kernel before. Commonly used Metrics monitoring methods, such as Prmoetheus, Spring Cloud Sleuth, Zabbix, etc., will be uniformly accessed for analysis. In addition, SkyWalking and the MOSN community will continue to cooperate:support tracking Dubbo and SOFARPC , while adapting to link tracking in Sidecar mode.

For more detailed information, refer to: http://skywalking.apache.org/zh/blog/2020-04-28-skywalking-and-mosn.html

Sentinel

Sentinel

Sentinel is a lightweight flow control framework for microservices that is open sourced by Alibaba. It protects the stability of services from multiple dimensions such as flow control, fuse degradation, and system load protection. MOSN currently only has a simple current-limiting function, so we cooperate with the Sentinel community to integrate a variety of different current-limiting capabilities into MOSN to further improve the traffic management capabilities of MOSN, while significantly reducing business current-limiting access and configuration costs.

MOSN cooperates with Sentinel

For long-term planning, we will mention later that we will use this as an entry point to propose a new unified UDPA-based current limiting standard.

Dubbo

Dubbo

For supporting Dubbo, we are mainly based on the following background:

  • Dubbo is a service implementation framework, Service Mesh is a framework concept, Dubbo also needs to enjoy the dividends brought by Service Mesh, enterprise adaptation and expansion needs exist objectively, Dubbo community also has such user needs;
  • Many users and enterprises cannot achieve cloud native in one step, and need to gradually land;
  • The current open source solution cannot support Dubbo service discovery;

MOSN supports Dubbo protocol

Previously, our xprotocol architecture based on MOSN supported the Dubbo protocol, but we did not implement the Dubbo-based service system as a whole. This time we designed two solutions to meet the needs of users for Dubbo. It is also a dual-mode microservice architecture:the left is Based on the traditional Dubbo registration center, the Dubbo-go SDK is integrated to meet the Mesh under the traditional architecture:

  • MOSN provides Subscribe, Unsubscribe, Publish, Unpublish HTTP services;
  • The SDK sends a request to these services provided by MOSN to let MOSN interact with the real registration center;
  • MOSN is directly connected to the registration center through Dubbo-go;

The picture on the right is directly extended through Istio and supported in a cloud-native way. This solution is a community partner's ability to contribute to multi-point life. Detailed technical solutions and usage methods can be read "Multi-point Life on Service Mesh" Practice - Istio + Mosn's way of exploration in Dubbo scenarios" .

Cloud Native Standard Evolution

Earlier we mentioned that whether it is Ant Financial or other companies, although the production level has practiced Mesh, it is implemented in the traditional way. Of course, this is also based on the current status of each company. With the exploration of technology, the operational and maintainability and rationality of the architecture of the cloud-native service management system Istio have gradually ushered in positive changes. The improvement of its functions, performance improvement, complexity of deployment and operation and maintenance will be obtained. To solve, at the same time, with the comprehensive and deep-scale evolution of cloud native, non-cloud native architecture is bound to hinder our progress. Therefore, we work closely with the Istio community to build a global Service Mesh control plane, and at the same time work closely with the cloud-native network agent MOSN to promote our evolution from traditional to cloud-native mesh. To this end, we have done the following work:

  • The creation of the cloud native standard Sidecar;
  • Standardization participation and construction;

For the first point, MOSN continues to align the Istio capabilities, including support for multiple sidecars on the Istio side and functional alignment on the MOSN side. The control plane supports the adaptation of the injection of MOSN Sidecar, Pilot-agent, and the adaptation of Istio compilation and construction. Load balancing algorithm, flow management system, flow detection, service governance, Gzip, etc., the entire Milestone:

  • Complete the dismantling of related requirements and tasks in April 2020, you can run Bookinfo in Istio-1.4.x version;
  • Completed the development of the HTTP system's strongly dependent functions in June 2020, compatible with Istio-1.5.x under the new architecture;
  • In August 2020, HTTP system functions aligned with Istio;
  • Supports Istio version pre-release in September 2020;

In terms of standardization, we participated in the discussion of UDPA related specifications, and proposed the current limit general API specification discussion , in the discussion organization of community meetings.

UDPA discussion

In addition, MOSN has been actively communicating and seeking cooperation with the Istio community. Our goal is to become the Sidecar product recommended by Istio. We have mentioned the relevant ISSU on Istio github, which has caused a lot of attention and is very much I am glad that the official Member has answered and discussed this question in great detail.

MOSN cooperates with the Istio community

They raised some questions and concerns about this, and had a special discussion at Istio's regular meeting.

Discussion on cooperation with the Istio community

Discussion transcript

For detailed discussion records, please see: https://github.com/istio/istio/issues/23753

After this communication, we got official thoughts and suggestions on this, which gave us a very clear goal and motivation. On the other hand, we also have corresponding ideas and actions for some questions raised by Istio:

  • For test case coverage costs, you can reduce the maintenance cost by decoupling the binding of test cases and Envoy in Istio, or formulating a standard suite of data plane test sets;
  • In addition, students in the MOSN community can join together for maintenance, thereby reducing maintenance costs;

We will continue to invest resources to focus on building our own capabilities, while maintaining a collaborative relationship with the community. We believe that when the time is right, the two parties will cooperate in depth.

Enhanced basic core capabilities

Where is the future of Service Mesh and what shape will it develop? What capabilities should MOSN have to support the continuous evolution of Service Mesh? In the previous article, we adopted open source ecological construction and the evolution of cloud native standards to solve the problems of non-comprehensive cloud native and low ecological integration. So for other problems, combined with the needs of Ant Financial's own scenarios, we have done a lot of capacity building:

  • Flexible and convenient multi-protocol extension support;
  • Multi-form scalability;
  • Message and P2P communication model;
  • OpenSSL support;
  • Transparent hijacking ability;

Protocol extension

Achilles Heel

Achilles' heel

I used the Achilles heel to describe the painful extension of the agreement, enough to see the bitterness of eating in this stepped pit. Whether it is Apache httpd in the "old age", Nginx in the "middle age", or "modern" Envoy, it is a framework designed for HTTP or other general protocols. Although many extension products have done a lot of extensions, they are private Protocol expansion is still relatively difficult. In addition to the forwarding support of the protocol itself, a general framework governance cannot be achieved. Therefore, we need to provide independent system support for each protocol behavior. The framework needs to understand the entire request life cycle, connection reuse, routing strategy, etc., and the research and development costs are very large. Based on these practical pain points, we designed the MOSN multi-protocol framework, hoping to reduce the access cost of private protocols and accelerate the implementation of the popular ServiceMesh architecture. For more detailed content, please see the video sharing at that time:" Cloud Native Network Agent MOSN Analysis of the multi-protocol mechanism

MOSN Multiprotocol Framework

MOSN multi-protocol framework

MOSN Multiprotocol Framework-2

Scalable modularity

With the development of the business and our planning for Service Mesh, MOSN needs to bear more and more basic capabilities. Only by providing flexible, efficient and stable scalable mechanisms can we maintain its competitiveness and long-term vitality.

MOSN borrowed from the excellent design of Nginx and Envoy at the beginning of the design, and provided a scalable mechanism based on Filter. Through Network Filter, you can create custom Proxy logic. Through Stream Filter, you can provide functions such as current limiting, authentication, and injection. Through the Listener Filter can support the ability of transparent hijacking.

But there will be a problem here, that sometimes the expansion capabilities we need are already available, so can we make a simple modification so that MOSN can obtain the corresponding capabilities, even if the currently available implementation is not the Go language implementation For example, the realization of ready-made current limiting capability, the realization of injection capability, etc.; or for some specific capabilities, it requires stricter control and higher standards, such as safety-related capabilities.

Similar to this scenario, we introduced the MOSN Plugin mechanism, which supports that we can independently develop the capabilities required by MOSN or that we can introduce them into MOSN after we properly modify the existing programs.

Extensible Modular Capability

MOSN's Plugin mechanism contains two parts:

  • First, MOSN's custom Plugin framework, which supports the realization of MOSN's expansion capabilities through the interaction between the agent and an independent process in MOSN;
  • The second is based on Golang's Plugin framework, through the dynamic library(SO) loading method, to achieve the expansion of MOSN. Among them, there are still some limitations in the way of dynamic library loading, which is still in the beta stage;

In addition, the currently popular WebAssembly is also the direction of future development. In many scenarios, it already has relatively mature support. The Golang official currently also has a branch of WASM. I believe that we can also enjoy the dividends of WASM in the near future.

Message communication mode

With the coming of Service Mesh and the surge of practice, in addition to the traditional service communication RPC, Mesh, DB, cache and other forms of Mesh requirements have also surfaced, but fortunately these communication modes are similar to RPC, we do not need to Sidecar can support too many changes. But the message communication is different:

  • Stateful network model;
  • Message sequence;
  • Partitions are load atoms;

Message Communication Mode

This makes the Message SDK unable to use Partitions to sequence messages, resulting in Mesh messages cannot be guaranteed to be sent and received normally. The Partitions in the Pull/Push Consumer of the message are the basic unit of load balancing. The original Consumer actually wants to perceive the number of Consumers who consume the same Partitions under the same ConsumerGroup. Each Consumer chooses the corresponding Partitions according to his position Consumption, which makes the load balancing strategy in the message no longer applicable to the Service Mesh system.

Message Communication Mode-1

OpenSSL support

In this year's plan, we will implement east-west traffic encryption based on Service Mesh to provide stronger encryption protection for transmission traffic. At the same time, it will also introduce the national secret algorithm to improve the security compliance capability, and achieve a full range of credibility based on secure hardware. The cornerstone of all this is the need for an efficient, strong and stable cryptographic infrastructure. MOSN's native Go-TLS has many problems:

  • Weak security capability:There is no key security mechanism for software/hardware;
  • Long iteration period:Go-TLS only fully supports the security features of TLS1.3 until version 1.15+;
  • Poor suite support:only supports typical algorithms such as ECDHE, RSA, ECDSA;
  • Weak performance:Typical performance such as RSA and Go versions is less than 1/5 of C version;

As the eldest brother of the cryptographic infrastructure, OpenSSL has become our only choice. OpenSSL has extensive use, rich hardware acceleration engines, full-time community staff maintenance, large and comprehensive suite support, and highly optimized algorithm performance. Of course, we have also done sufficient testing and thinking about how to support OpenSSL. If we use traditional Cgo to take over all TLS processes, although we enjoy an integration, which is convenient for life, we cannot accept the performance loss caused by Cgo, so in the end we The scheme adopted is a mixed use to achieve specific security capabilities.

OpenSSL Support

Transparent hijacking

Although the community provides a non-intrusive access Service Mesh solution, the performance loss and operation and maintenance costs caused by the native community solution are very large, so in practice we have not achieved non-intrusive access. But as the business expands to a greater extent, non-intrusive capabilities are imminent, and we need to solve the problems of multi-environment adaptation, operability, and performance. We are still based on Iptables as the data plane to achieve traffic hijacking, but optimized for different situations:

  • Tproxy replaces DNAT to solve Conntrak connection tracking problem;
  • Hook Connet system call solves the performance loss caused by outbond traffic traversing the protocol stack twice;
  • Fuzzy matching black and white list reduces the management cost of the overall rules;

The development of traffic hijacking technology is closely related to the implementation of Service Mesh. In the future, we will continue to evolve in terms of environmental adaptability, low latency, low management cost, etc., and build a multi-mode composed of DNAT, TProxy, TC redirect, Sockmap and other technologies The monolithic base adaptively selects the most appropriate hijacking technology in scenarios with different core environments, different performance requirements, and different management costs to continuously reduce the access cost of Service Mesh.

Service Mesh

The above is our continuous exploration under MOSN and Service Mesh after Double Eleven last year. The overall Milestone is as follows:

MOSN Master Plan

In my opinion, the Service Mesh architecture is like a high-speed rail to the national economy for cloud-native architecture. We have gone through a decade of cloud computing. In the process, seemingly solid industry and technical barriers have been constantly broken, and classic ideas are often questioned and challenged. Then Service Mesh must also undergo major changes in the future. Teacher Xiaojian actually made an in-depth analysis of this "Mecha:Carrying Mesh Through" . I will not repeat it here, mainly to talk about some of my personal views. First of all, in the development trend, business and basic technology continue to be decoupled and coordinated; middleware continues to sink, and the basic layer of the business sinks; basic business needs to be better integrated with the Mesh architecture to form an ecology, which is highly consistent. At the same time, I think that with the expansion of the cloud-native network boundary, it will inevitably bring about a large-scale effect. We need to solve various basic problems such as performance, resource consumption, and latency. Therefore, we need to solve it through Kernel Bypaas, Sidecar as Node, and hardware optimization. The above question. At the same time, we believe that in the evolution of cloud native, the container network will be integrated with Service Mesh. From IP-oriented to Identity-oriented and service-oriented, Sidecar can be precipitated into the system infrastructure, becoming a secure container network stack, and a basic network of intelligent hardware equipment. unit.

When Sidecar sank as part of the system, it began to develop from the framework to the platform, providing distributed primitive abstraction and providing remote APIs like Dapr. It is an implementation to provide external services. In addition, we are trying to communicate based on shared memory. In the end, the business will develop into Mesh-oriented programming, and the Mesh architecture will eventually form a distributed microservices OS.

But No Silver Bullet, although distributed systems have become the mainstream form of new business, in many traditional areas, centralized architecture still exists in many core systems. The most important thing of this kind of system is the stability demands of operation and maintenance efficiency and high availability. This is the strength of the mature centralized architecture. In the foreground of the business, more challenges are how to cope with the rapid changes in the market, make rapid iterations, and seize the market. The distributed architecture, especially the microservices framework, was created to help users to iterate quickly and launch business capabilities. Service Mesh will now become a booster for this architecture.

About the Author

Xiao Han, Hanming Hanchang, joined Ant Financial in 2011, and has been engaged in research and development related to four/seven-layer network load balancing, high-performance proxy servers and network protocols. At present, he is the head of the application network group of the trusted native technology department of Ant Financial, and the head of the cloud native network agent MOSN of the open source project of Ant Financial.

Ant Financial Service Mesh Double Eleven landing series articles