Demystifying the Ten Thousands of Tencent Meetings, Full Cloud Native on TKE Technology Practice
Posted Jun 16, 2020 • 16 min read
Tencent Meeting, an online meeting solution for United Nations Pick, provides perfect meeting quality and flexible collaboration space, and is widely used in various industries such as government, medical, education, and enterprises. Everyone from the article Expanding 1 million cores in 8 days, how did the Tencent conference do it? all know that the computing resources behind the Tencent conference have exceeded one million cores, such a volume of business, how to improve the efficiency of R&D and operation and maintenance through cloud native technology is a Very valuable subject. Here I will reveal to you the technology behind Tencent's self-developed cloud container platform TKEx, which supports the full cloud localization of Tencent conferences.
The TKEx platform is a container platform serving Tencent's self-developed business based on Tencent Kubernetes Engine(TKE). Tencent's self-developed business has many types and large scales, and the challenges facing cloud-native cloud are conceivable. The best practices and solutions of the TKEx platform in the process of Tencent's self-developed business going to the cloud will be provided to customers in TKE.
Tencent meeting business features
In Kubernetes, we are used to classifying applications into two types, stateless and stateful. Stateful applications mainly refer to stateful identification of instances, networks, and storage. Some services of Tencent Conference have the following characteristics:
Use IPC shared memory, the state data stored in it varies from MB to GB.
- The IPC data cannot be lost during the upgrade;
- Only ms-level jitter is allowed during the upgrade, and the user has no perception;
The number of instances with the most services exceeds tens of thousands, which requires efficient completion of a version upgrade;
Global multi-regional deployment, requiring efficient deployment;
Some services require EIP to be assigned to each instance;
This puts higher capabilities and performance requirements on Kubernetes to manage this stateful service. The TKEx platform abstracts the product requirements behind the business features, and enhances and optimizes in terms of grayscale release, multi-cluster workload management, computing resource management and operation, and Node stability, resulting in a general audio and video service container orchestration capability.
StatefulSetPlus' powerful grayscale publishing capabilities
StatefulSetPlus is one of the first operators we developed and put into production in 2018. The core features include:
Compatible with all features of StatefulSet, such as rolling updates in sequence.
Support batch grayscale update and rollback. Pods in a single batch can be updated concurrently and serially.
- Support each batch to manually upgrade Pods to be upgraded.
- Support users to configure the proportion of each batch of Pods to be upgraded for grayscale.
- Support batch rollback and one-key rollback.
- The publishing process can be suspended.
Support a single StatefulSetPlus object to manage tens of thousands of Pods.
Support batch grayscale release of ConfigMap.
Connected to TKE IPAMD and realized Pod fixed IP.
Support HPA and in-situ VPA.
LastGoodVersion is used for capacity expansion during the upgrade.
Support Node core state self-check, Pod can drift automatically when Node is abnormal.
Support in-place upgrade of containers.
Supports the tolerance rate control of failed Pods during upgrade. If the failed Pods account for less than x%during the large-scale upgrade, the upgrade can be continued.
Here we mainly introduce two new enhancements to TKE's publishing capabilities at the Tencent conference:large-scale automatic batch gray-scale publishing and ConfigMap batch gray-scale publishing.
Support the automatic batch release capability of tens of thousands of Pods in a single StatefulSetPlus
Based on the original StatefulSetPlus manual batch release capability, the TKEx platform has also developed the feature of automatic batch release this time to solve the pain points of large-scale gray-scale release of services such as the Tencent conference. Users only need to configure the percentage of updated copies of each batch at the time of release, such as 40%in the first batch and 60%in the second batch. The StatefulSetPlus-Operator will automatically update the next batch according to the completion of the Readiness probe. The principle is as follows.
StatefulSetPlus core Field description is as follows:
- batchNum:upgrade in several batches
- batchAuto:whether to publish automatically in batches, true means to publish automatically in batches
- batchIntervalMinutes:the number of minutes between batch releases
- podsNumToUpdate:the number of pods released in each batch, if not set, the pods will be evenly distributed to each batch
StatefulSetPlus has carried out fine-grained monitoring of the release process and provides
staus.batchDeployStatus to query detailed release status, which makes the release via CI Pipeline more displayable and controllable.
- action:current operation,
Nextmeans to issue the next batch of releases,
WaitToConfirmmeans to wait to confirm whether the batch is released successfully,
Completedmeans that all batches have confirmed the successful release.
- batchDeadlineTime:Deadline released in this batch, if the Pod in this batch is still not Running & Ready after this time, then the batch release fails and enters the automatic rollback flow
- batchOrder:current batch
- batchOrdinal:the starting point of the index of pods released in this batch
- batchReplicas:the number of pods released in this batch
- currentDeployComplete:Whether the batch release is completed
- currentOrderSuccessPer:the percentage of pods successfully upgraded
- currentOrderProgress:Whether the batch is released successfully
- currentRollbackProgress:whether the batch rollback is successful
- generalStatus:This release of the global status
- action:current operation,
You can add
platform.tkex/pause-auto-batchDeploy:"true" to annotations to suspend automatic batch release and fail back automatically.
On the TKEx platform, automatic batch release can be easily completed through the following operation process.
The largest module of the Tencent conference needs to support the grayscale release of tens of thousands of Pods, which is an unprecedented challenge. This time, we optimized the StatefulSetPlus-Operator, and the performance was greatly improved. For the StatefulSetPlus of 10,000 pods, it is automatically upgraded in 5 batches, with a single batch of 2000 pods and no cbs disk mounted. The performance is as follows:
- Non-in-place upgrade method:single batch upgrade processing takes 40-45 seconds, single batch upgrade takes three and a half minutes from initiating upgrade to completion of upgrade, and single synchronization StatefulSetPlus status takes about 10 seconds during the upgrade process .
- In-place upgrade method:Single batch upgrade processing takes about 30 seconds, single batch upgrade takes about one minute and ten seconds from initiating upgrade to completion of upgrade, and single synchronization StatefulSetPlus status takes about 10 seconds during the upgrade process.
- Under normal circumstances(non-upgrade process), synchronize StatefulSetPlus status in milliseconds.
Support batch grayscale release and version management of ConfigMap
The Kubernetes native ConfigMap update is a one-time full update to the corresponding configuration file in the container, so it is extremely dangerous to update the configuration file in a native way. Kubernetes 1.18 supports Immutable ConfigMap/Secret, which can protect key configurations from being erroneously changed and causing business impact. The business also has a very high demand for the release of configuration files in batches in a gray scale.
So we gave StatefulSetPlus the ability to publish configuration files in batches, improving the security of configuration file publishing in cloud-native scenarios. The principle is as follows:
- The user submits after modifying the ConfigMap, and a new ConfigMap is automatically created in the background, where the ConfigMap Name suffix is the hash value of the data content, to prevent the same data content from creating multiple ConfigMaps, and then add the real ConfigMap without the data hash value to the Lable Name, add version to the lable, or allow the business to customize some lable to identify the version of ConfigMap.
- Kubernetes modification of Pod only supports update field
spec.containers[*].image, spec.containers[*].resources(if inplace resources update feature enabled), spec.initContainers[*].image, spec.activeDeadlineSeconds or spec.tolerations(only additions to existing tolerations), so the kube-apiserver code needs to be modified to allow update/patch volumes.
- Through StatefulSetPlus' batch grayscale release capability, the ConfigMap referenced by Pods is modified batch by batch. The kubelet volumemanager automatically reloads the configmap, so the update of ConfigMap does not require the reconstruction of Pods.
In order to prevent the excessive accumulation of ConfigMap and affect the performance of etcd cluster, we add the recycling logic of ConfigMap in the self-developed component
TKEx-GC-Controller, and only keep the latest 10 versions of ConfigMap.
Users only need to update the Workload page, select manual batch or automatic batch update, and select the new version of ConfigMap in the data volume option. You can update the ConfigMap configuration file at the same time as you update the business image, or only the ConfigMap configuration file.
ConfigMap configuration file update requires the business process in the container to watch the configuration file for restart loading or hot loading. However, some services do not currently have this capability, so TKEx provides the ProUpdate Hook after the configuration file update at the entry posted by ConfigMap, such as the cold/hot restart command of the business process.
How to ensure that the upgrade of stateful services has only ms-level jitter
Rejecting the fat container model(using containers as virtual machines) is the principle of the TKEx platform. How to use image release and provide ms-level business jitter like process restart is one of the most challenging requirements of Tencent conference containerization on the cloud . The TKEx platform has made a long-term technical precipitation in the gray-scale release capability, and tens of thousands of business modules are in use, but the current capability still cannot meet this demand. The image pre-loading + container in-place upgrade solution still falls short of this goal. Far away.
After the design, analysis, and test comparison of multiple solutions, considering the multiple factors of versatility, cloud native, and release efficiency, the following solutions are ultimately used:
There are three key containers in the Pod, and their responsibilities are as follows:
- biz-sidecar:The responsibility of the Sidercar container is simple. Check whether the Pod is being upgraded. Use the Readyness Probe to compare whether the contents of the business release version files version1 and version2 in the EmptyDir Volume are equal. If they are equal, then Ready, otherwise notReady.
- biz-container:The container startup script writes a version number in the environment variable(pre-injected) to the versionX file, and then starts to loop and wait for the file lock. If the file lock is successfully obtained, the business process is started. The file lock is the key to preventing the Pod from running multiple versions of the Business Container at the same time. The file lock is used to mutually exclude containers of different versions.
- biz-pause:The startup script will write a version number in the environment variable to the versionX file, and then enter an infinite sleep state. This container is a backup container. When the business is upgraded, it will switch to the role of biz-container through the in-place upgrade.
Overview of upgrade process
Taking the upgrade of the service container image from version V1 to version V2 as an example, the upgrade process is described as follows:
- The user deploys services for the first time, such as the left-most Pod above, with a total of 3 containers. biz-sidecar, biz-container(configuration environment variable version number is 1) and biz-pause(configuration environment variable version number is 1). After all two containers are started, the contents of the version1 and version2 files are updated to 1, and the biz-sidecar is now Ready.
- The biz-pause container before updating the Pod is the image of the business V2 version and the environment variable version number is 2. After the container is upgraded in place, the content of the version2 file is updated to 2 and the file lock is started. At this time, the biz-sidecar probe changes to notReady state.
- After StatefulSet-Operator Watch to biz-sidecar is notReady, replace the business image of the previous v1 version with the biz-pause image and the environment variable version number is 2. After the pause image container restarts in place, the file lock occupied by the previous v1 service image is released, and the content of version1 is updated to 2. At this time, the sidecar probe is Ready, and the entire upgrade is complete.
The following two points need to be explained:
- The native Kubernetes apiserver only allows modification of fields such as Pod's image, and does not support modification of resources and environment variables, etc., so the program needs to change the relevant code of K8s apiserver.
- In addition, in order to ensure that the Pod Level Resource and Pod QoS remain unchanged, the StatefulSetPlus-Operator needs to adjust the Container Resource during the container state change process during the upgrade.
Multi-region deployment and upgrades become easier
In terms of multi-regional service management, we mainly address two demands:
- The same service needs to be deployed in many regions to provide nearby access or multiple disaster recovery, how to quickly replicate the service in multiple clusters;
- How to quickly upgrade the same service deployed in multiple regions;
TKEx provides convenient multi-region and multi-cluster business deployment and business synchronization upgrade capability.
- Support deployment to multiple regions and multiple clusters at once.
- Support the simultaneous upgrade of Workload deployed in multiple clusters.
Enhanced platform resource management capabilities
The cluster resources of the TKEx platform are shared by all services, and various services are mixed in clusters and nodes. Each product has its own resource budget. The platform accepts the budget of each product, and then automatically generates corresponding resource quotas to control the Quota of each product on the entire platform. After product deployment involves cost accounting, the platform will measure the granularity in hours based on the amount of resources actually used, and track and count the resource usage of each Workload under each business product.
Kubernetes natively uses ResourceQuota for resource limitation, but it has the following problems compared with our expectations:
- ResourceQuota is based on Namespace and cannot achieve basic product restrictions.
- ResourceQuota is based on the limitations in the cluster, and cannot be platform-level, and cannot be used for multi-cluster linkage Balance.
- Only limited capacity can not guarantee that the business has sufficient resources to use.
Based on our management needs and expectations for business products, TKEx's quota management system must meet the following characteristics:
- Simple to use, users do not need to care about the underlying details, such as how the quota is distributed and allocated among the clusters are automatically completed by the system.
- The quota allocated to the product must ensure that the product always has so many resources available.
- To meet the requirements of the platform in offline hybrid deployment scenarios, the quota must have the ability to limit the quota of offline tasks.
- In order to avoid the waste of platform resources due to a certain product occupying the quota and not using it, it is necessary to have the ability to borrow and repay the quota between products.
We have designed a DynamicQuota CRD to manage the Quota of each business product in the cluster to achieve the above capabilities.
- Quota Rebalance Worker:The Worker will periodically allocate product quotas among clusters based on the product's quota usage in each cluster. For example, the service of a product is configured with elastic expansion and contraction. When the product runs out of quota in a certain cluster due to expansion but there are more quotas in other clusters, then the Worker will allocate the quota from the idle cluster to The cluster.
- DynamicQuota Operator:Responsible for maintaining the state of the custom CRD DynamicQuota, and at the same time collecting the usage of each product in the cluster and exposing it to Prometheus.
- DynamicQuota ValidatingWebhook:Intercept all pod creation requests to kube-apiserver in the cluster, and prevent those over-quota product pod creation requests.
- OfflineTask QueueManager:Responsible for consuming from the offline job queue(ActiveQ) according to job priority, and determining whether the proportion of offline job resources of each cluster exceeds the watermark to achieve the purpose of controlling the proportion of all offline job resources and preventing offline operations Excessive cluster resources.
- Pod-resource-compressor and VPA components, according to the actual load of clusters and nodes, resource allocation, offline resource compression and in-situ distribution to protect the resource use of online tasks. When mixing offline, we also optimized CPU scheduling at the kernel level to achieve rapid evasion of offline tasks to ensure the quality of service for online tasks.
Budget transfer automatically generates product Quota
After the budget of the product is assigned to the TKEx platform, the platform will automatically increase the corresponding product quota for the product and automatically modify DynamicQuota. The user can view the resource quota of the attribution product in the TKEx monitoring panel.
Business accounting automation and visualization
TKEx will perform cost accounting with the measurement granularity of the resources used by the business in
core*time. Users can view the detailed resource usage of each Kubernetes Workload in the TKEx monitoring panel.
Improve self-healing ability
As the cluster size and node deployment density become higher and higher, the average load of the cluster exceeds 50%, and the load of many nodes even exceeds 80%during the peak period. Some stability problems begin to appear. For this reason, TKEx has optimized the node stability as follows:
- Actively detect the availability of dockerd, actively restart dockerd when abnormal, to prevent dockerd hung from causing the Pods on Node to be automatically destroyed and rebuilt,
- Actively monitor the availability of kubelet, and actively restart kubelet when abnormal to prevent kubelet hung to cause a lot of Pods drift reconstruction.
- Because Kubernetes has poor kernel parameter isolation mechanisms such as pids.max and file-max, although kubernetes 1.14 supports the restriction of Pids numbers in Pods, it is difficult to specify the default pids limit for the business when it is actually landed. In the cluster, the problem that other business containers on the same node are affected due to the exhaustion of the Pids and file-max nodes will still occur. In this regard, we added the monitoring of node pids and file-max in the
Node-Problem-Detector(abbreviated as NPD) component. When the relevant resource uses the water level, it will automatically detect the Container that consumes the most pids and file-max and Report, actively trigger the alarm and restart the Container in place.
The above several capabilities are implemented in the NPD component, which is responsible for monitoring the working status of the node, including kernel deadlock, OOM frequency, system thread count pressure, system file descriptor pressure and other indicators, doing node eviction, node hit Taint and other actions, and Report to Apiserver in the form of Node Condition or Event.
The current NPD component will add the following specific Conditions to the node:
|ReadonlyFilesystem||False||Whether the file system is read-only|
|FDPressure||False||See if the number of file descriptors of the host reaches 80%of the maximum|
|PIDPressure||False||See if the host has consumed more than 90%of pids|
|FrequentKubeletRestart||False||Whether Kubelet restarted more than 5 times within 20Min|
|CorruptDockerOverlay2||False||Is there a problem with DockerImage|
|KubeletProblem||False||Whether Kubelet service is Running|
|KernelDeadlock||False||Whether the kernel has a deadlock|
|FrequentDockerRestart||False||Is Docker restarted more than 5 times within 20Min|
|FrequentContainerdRestart||False||Whether Containerd restarted more than 5 times within 20Min|
|DockerdProblem||False||Whether the Docker service is Running(if the node is Containerd when running, it will always be False)|
|ContainerdProblem||False||Whether the Containerd service is Running(if the node is running Docker, it is always False|
|ThreadPressure||False||Whether the current number of threads in the system reaches 90%of the maximum value|
|NetworkUnavailable||False||Whether the NTP service is Running|
Some events are not suitable for distributed detection in NDP DaemonSet, so we put it in TKEx Node Controller to do centralized detection, which generates Event and sends it to Apiserver. such as:
- Quickly detect Node network problems without relying on the 5min delay of NodeLost Condition. This problem cannot be detected and NDP cannot send events to Apiserver.
- The continuous high load of the Node Cpu leads to the deterioration of business service quality. There will be a TKEx Node Controller to detect and send the Top N Pods of the CPU load to the Apiserver through Event. The TKEx-descheduler will decide which Pods to expel. When making eviction decisions, you need to consider whether the Workload to which the Pods belong is a single copy, and whether Pods can tolerate Pods drift reconstruction.
TKEx-descheduler is responsible for the events sent by ListWatch NPD and TKEx Node Controller to make corresponding behavioral decisions, such as restarting a problem Container in a Pod in place, eviction of a problem Pod, etc.
Container network enhancement and scheduling optimization
Container network supports EIP
The underlay network solution of
VPC+ENI previously provided by TKEx makes the container network and the CVM network and IDC network on the same network plane, and supports the fixed IP of the container, which greatly facilitates the self-developed business to go to the cloud. This time, the container network capabilities of the TEKx platform are further upgraded to support the ability to assign EIP(elastic public network IP) to Pods on the use of
VPC+ENI container network solutions.
When the backend cluster resource pool is exhausted, there will be a large number of pending pods to be scheduled. At this time, when any type of Workload is used to update the image, the resource preemption will cause the upgrade to fail.
In order to solve this problem and improve the stability of business upgrades, we have optimized the logic of Kubernetes Scheduler Cache to provide resource pre-emptive scheduling capabilities for StatefulSet/StatefulSetPlus upgrades, which is a good guarantee for StatefulSet without adding resources /StatefulSetPlus can be successfully upgraded normally, and will not be preempted by the Pendnig Pod in the scheduling queue.
The team will output a technical article separately to analyze this in detail. Interested students please pay attention to the Tencent Cloud Native public account, add a small assistant TKEplatform, and pull you into the Tencent Cloud Container Technology Exchange Group.
to sum up
This article summarizes the platform-related features used in the TKE containerized deployment of the Tencent conference, including automatic batch grayscale release of business images, ConfigMap batch grayscale release, ms/level switching release of A/B containers in Pod, and multi-cluster release management 2. DynamicQuota-based product quota management, detection of node and cluster stability issues to improve self-healing capabilities, etc. The excellent components and solutions that Tencent's self-developed business has deposited on TKE will be provided to public network customers in public network TKE products, and open source is also planned, so stay tuned.