Ant Financial's experience summary on Service Mesh monitoring landing

Posted Jun 16, 202010 min read


Service Mesh is currently the hottest technology direction in the community. Last year, Double 11 was fully applied in Ant Financial, and it smoothly and smoothly supported the big promotion service. As the largest Service Mesh cluster at present, this article summarizes the experience of Service Mesh landing from the field of monitoring, mainly from the following aspects:

  1. Cloud native monitoring, introducing the landing of Ant Financial's Metrics monitoring;
  2. Analysis from the user's perspective, introducing the experience of this basic service facility from the perspective of the application owner and the requirements of SRE from the stability of the total station service for monitoring;
  3. Thinking about the future, introducing the follow-up development direction;

Cloud native monitoring

The design concept of cloud-native applications has been accepted and recognized by more and more developers. This year, Ant Financial's application services are fully cloud-native, which puts higher demands on our monitoring services. At present, Metrics indicator monitoring service has gradually formed a system, as shown in the following figure based on the community's native Prometheus collection solution in the ant financial service monitoring scene.


How to collect

Ant Financial Monitoring and Acquisition AGENT is deployed on a physical machine and supports multiple collection plug-ins, as shown in the following figure, including execution commands, logs, HTTP requests, dynamic SQL collection, system index collection, JVM collection, and process monitoring, etc., and supports multiple analysis Plug-in custom parsing, single-line text parsing, Lua script parsing, JSON parsing, Prometheus parsing, etc.


In the implementation of Service Mesh monitoring, the business party refers to the industry standard output Metrics indicator data, monitors and collects the various indicators of the Pod, App and Sidecar of the physical machine, including Metrics indicators and system service indicators(CPU, MEM, DISK, JVM, IO) Etc.), and then calculate the cleaning cluster nodes by pulling the latest cycle data for data aggregation, groupby, etc. The data collection cycle is divided into:5 second level data and minute level data.
For Service Mesh, the main indicators concerned are system indicators and Metrics indicators:

  • System indicators(including Pod, App, MOSN and other sidecar multi-dimensional system indicators):

    • System indicators, including CPU, LOAD, MEM, BYTES, TCP, UCP and other information;
    • Disk, containing information such as free space and utilization rate of partition;
    • IO, including IOPS and other information;
  • Metrics indicators:

    • PROCESSOR, including process resource information such as the number of fds opened by the MOSN process and the size of the virtual memory requested;
    • GO, including go runtime information of the MOSN process goroutine number(G), thread number(M) and memstats;
    • Downstream, including the global downstream cumulative chain building number, total read bytes, cumulative request number, request time-consuming, etc.;
    • Upstream, including the number of upstream request failures, the cumulative number of chains established in the cluster, the number of cumulative chain breaks, the number of abnormal chain breaks, the average time spent on upstream requests, etc.;
    • MQ Mesh, including the total number of messages sent, time-consuming, failures, etc. and the total number of messages consumed, time-consuming, failures, etc.;
    • Gateway Mesh, including qps, rt, current limit and multi-dimensional success and failure numbers;

Data calculation

The data collected by the Agent needs to be aggregated from different dimensions to meet the data needs of different users from different perspectives(LDC, IDC, APP, architectural domain, site, etc.) to adapt to the Ant Financial Services O&M architecture system.


At this time, for such a large-scale data system, our team builds a unified monitoring data computing platform for Ant Financial.

  • Use unified monitoring data standards, plug-in data collection and access, and common data service API services to help rapid iteration of different monitoring products;
  • Establish a sound data quality system and high-availability computing cluster to ensure the quality of monitoring data;
  • Provide rich and open data analysis capabilities through SQL-like task definitions, custom calculation tasks, and plug-ins to meet the needs of various complex data analysis in technical risk business areas;


Among them, the key components for the execution of computing task scheduling(spark) include GS(Global-Scheduler global graph scheduling) and CS(Compute-Space computing space).

GS is the task scheduling center of the platform. As shown in the following figure, it collects the data source configuration of all businesses and builds a global computing task topology model(GlobalGraph) based on the calculation relationship between the data sources. According to different task execution strategies, the global task topology graph is cut into a small-scale task topology(Graph). The main features are:

  • GS distributes Graph to different computing spaces for calculation(Cspace) according to strategies such as task priority, resource quality, load, etc.;
  • The data dependency within the same Graph is directly dependent on the calculation process;
  • The data dependency between different Graphs will decouple the data through storage;
  • GS will manage the task status of all Graph and computing nodes, and control the execution time of Graph according to the dependency of Graph and the execution status of Graph;


CS is the abstract computing task execution space of the computing platform. As shown in the following figure, it is mainly responsible for the analysis of Graph and the submission and execution of specific computing tasks. It is suitable for different computing engines, such as Spark/Flink. Taking Spark as an example, CS receives the GraphTask from GS, and parses it into Spark's Transfomation operator and Action operator according to the Node(Transform) in GraphTask to form a calculation DAG and submit it to the Spark cluster for execution.

During task execution, CS will synchronize the execution status of each task to GS for task tracking and monitoring.


Multiple CSpaces form a CSpaceGroup, and CSpaces can be divided into different calculation groups according to specific scenarios such as load balancing, resource level, and blue-green publishing. Task switching between multiple CSpaces can meet load balancing, resource isolation, and blue-green publishing , Grayscale and other high availability requirements.

Scaling issues

For the large-scale Service Mesh cluster data of Ant Financial, product requests cannot be real-time query results through PromQL, and timely notification of alarms. At this time, we classify the monitoring data, including the application, equipment room, site and other dimension data for pre-calculation and aggregation, such as QPS in different equipment rooms, RPC forwarding success amount, Error error, etc., the front-end through custom configuration of the large-scale view of interest .

Among them, this year, the MOSN container has been promoted to reach hundreds of thousands. During the frequent release and deployment, the online and offline processes have raised higher requirements for the real-time nature of monitoring and viewing. Among them, the Meta metadata module is connected to the K8s cluster, and a monitoring operator is deployed to monitor the status change of the container, and the latest collection configuration is updated to the Agent module through the Agent registry at the second level.


Big promotion guarantee

On the one hand, we guarantee the high availability of monitoring, and expand and shrink the capacity of the collection and calculation level. On the other hand, we evaluate the capacity and perform high-priority tasks by guaranteeing the high-priority tasks by grading different tasks. Communicate with low-priority tasks and business parties for demotion. In this way, in the case of tight monitoring and computing resources, the core data is guaranteed to be stable.


Product perspective

Service Mesh is the basic service facility used by Ant Financial's internal application services, and has different perspectives on different users. In terms of monitoring products, users' use of products is mainly concentrated in the three levels of "distribution, viewing, use" data. We did a similar user analysis earlier. In Ant Financial, users are divided into global followers, product owners, SREs, domain experts, and ordinary users according to the way of use. The monitoring products here also provide different perspectives on Service Mesh to meet different user needs. For example:

  • Product Owner perspective:specifically refers to the developers of MOSN products, who are mainly responsible for MOSN's monitoring index data coverage, data accuracy and key optimization goals;
  • Ordinary user perspective:refers specifically to the application Owner, the application Owner mainly depends on the impact of the MOSN service on the application RPC call and the efficiency improvement brought by the application using the MOSN service;
  • SRE perspective:they pay attention to the global perspective, need to know the stability of all MOSN services, and pay more attention to early warning and analysis;
  • Domain expert perspective:specifically refers to users of deep monitoring data, such as deep JVM, CPU, Go and other indicators, and more in-depth perf and jfr analysis;
  • Global perspective:refers specifically to the architect level or the whole-site dimension followers, focusing on the whole-site application service field;

Application Owner

Application Owner is looking forward to this new service and is nervous, not only looking forward to what new features and services this MOSN service can bring to itself, but also worried that the new service will bring me another layer of dependency and stability issues. At this time, for the product, while meeting the observability of the data, it focuses on MOSN core index observation and MOSN Error data archiving. At the same time, the alarm capability is adapted in time, so that the development Owner knows where the problem is.

Since the deployment model of MOSN is in the same pod as the application container, then the application Owner will worry about resource preemption at this time. Of course, it is ultimately verified by data. At this time, the comparison of water level data is indispensable.


MOSN Product Expert

MOSN product technical experts are confident in their new services, but they need to check the overall performance indicators and performance tuning of their products in order to achieve optimization. So at the beginning, the monitoring products cooperated with the MOSN service to complete data coverage and accuracy verification from offline to online, and then to global observation and comparison of core indicators.

During the launch of the MOSN service, most of the dealings are with MOSN technical experts. Similar to the MOSN market, there are already large-scale display of application dimension convergence, but for error troubleshooting, the global stand-alone dimension system indicators(cpu, mem, load) top n are more Meaningful, can help quickly find abnormal instances of CPU and MEM.


SRE expert

SRE experts are always inexplicably worried about the launch of new products, especially this year's Ant Financial MOSN service is of such a large scale, so at this time, sufficient data needs to be verified to meet the online standards. At this time, it is necessary to monitor and provide data, especially the data of the whole station dimension. For this reason, we specifically provide core application services to watch the disk, observe the rt of the core application MOSN, the amount of error reported, and the water level of the top instance during the pressure test.


Global Architect

Global observers certainly pay attention to the core indicators. While understanding the SRE stability solution, they also pay attention to the performance improvement brought by all MOSN services, such as the success rate of service forwarding, MOSN rt and other indicators.

In addition to the above basic product capabilities, we are also trying to continue to improve the product from the perspective of data, functions, and experience.

Future thinking

Ant Financial's monitoring products will be dedicated to becoming a full-stack monitoring in the cloud-native era. From application to infrastructure, from cloud to edge to end, the monitoring data in the technical risk field are all transparent and have one-stop observability. Internally, it will support business scenarios in various fields of technological risk, including emergency, capacity, current limit, security, change, promotion, etc., and externally will support technological output, cloud products, international empowerment, and commercialization.

The follow-up key direction is Monitoring as a Service, which aims to enable business R&D and SRE students to complete functions such as monitoring data collection, data aggregation, early warning rule configuration, and large-scale CMS report display through Code, to improve the convenience and flexibility of monitoring business scenarios And creativity, bring more possibilities to the colorful gameplay in the field of surveillance.

Finally, we also welcome like-minded partners to join us and participate in the design and innovation of financial-grade monitoring system architecture.

about us

Welcome to the world of "ant intelligent operation and maintenance". This public account is produced by the ant intelligent monitoring team and is aimed at students who are concerned about intelligent operation and maintenance technology. From time to time, we will share with you the thinking and practice of Ant Financial in the design and innovation of intelligent monitoring architecture in the cloud native era.

The Ant Intelligent Monitoring Team is responsible for the monitoring needs of Ant Financial s infrastructure and business applications. It is working hard to build a million-level machine cluster and hundreds of millions of service invocation scenarios, covering indicators, logs, performance and links. Data includes functions such as collection, cleaning, calculation, storage and even large-scale display, offline analysis, alarm coverage and root cause positioning. It also has a one-stop, integrated monitoring product with intelligent AIOps capabilities and serves many businesses of Ant Financial. And scenes.

Regarding "Smart Operation and Maintenance", if you want to communicate or discuss any topic, please leave a message to tell us.

PS:Ant Intelligent Monitoring is recruiting AIOps experts, welcome to join us, interested to contact