Global deployment, flexible scaling, and nearest dispatch: What do you think of the game server engine of Goose Factory?
Posted May 25, 2020 • 8 min read
If the game is hot and enduring, the connection is essential. However, online games have high requirements for low-latency games, stable services, and cost control, and great challenges for R & D and operation and maintenance.
Tencent Game Server Engine(GSE), supports stateful game service deployment and expansion and contraction, to achieve service discovery, efficient and flexible server scaling and nearby scheduling capabilities, helping developers quickly build stable, low latency The deployment environment of multiplayer games and save a lot of operation and maintenance costs, the following will give you a comprehensive explanation and analysis.
I. Demands for online game play
The online battle category includes FPS, MOBA, leisure IO, sports competition, chess, strategy and other games that need to be played with people and will end in a certain time.
1 . Low latency of the game to ensure a smoother experience for more players
Global players are widely distributed, and the centralized deployment of servers will make the network experience in some regions poor and the game experience affected, which is also a reason for the relatively small number of players in some regions.
Is there any way to reduce the delay and try to let more players join in?
Usually, the nearest scheduling or global acceleration(decentralized deployment at a point, each area is accelerated to this point) strategy can make the network delay reach an optimization. For games that are very sensitive to real-time performance, the nearest scheduling effect is more obvious.
However, there are a few tricky problems with scheduling nearby:
Option 1:The business is deployed in multiple regions, and players match and fight in one region nearby.
Problem:There are relatively few players in a certain area, which may not match the corresponding level of people. In the end, all players are concentrated in a large area, and in fact it has become a centralized deployment.
Option 2:Matching is carried out in a large area, and the matching is concentrated. During the match, it is distributed to different areas nearby.
Question:It is uncertain which regions will be matched together during the match, and there are also a large number of inviting friends to play together. The number of players assigned to each region may be very different every day. The server requirements of each region The quantity cannot be estimated accurately in advance. Less preparation is not enough, more preparation wastes resources. In order to meet the nearest dispatch in real time, it may be necessary to prepare the largest number of servers in each area, resulting in a sharp increase in server costs.
Finally, the actual plan:it may become centralized scheduling. For China, all services are deployed to Shanghai.
2 . The service is stable, guaranteeing the player experience and generating more revenue
In terms of ensuring stable service and player experience, the following challenges will also be encountered:
(1) Explosive growth, unable to expand capacity in time to accept more players
In order to respond to the explosive growth, a lot of work needs to be done in advance in R & D and operation and maintenance:to ensure that the service can be expanded in parallel, and by adding servers, the game can support players without an upper limit.
This is a stateful expansion and contraction scenario:for game services, especially for battle services, it cannot be done simply by adding a CLB(load balancing). In the game service, you need to disconnect and reconnect to find the previously connected server; in addition, the game process cannot be interrupted due to shrinkage.
** R & D side:Service registration, service discovery, service scheduling, service management, etc., to ensure that services can be automatically expanded in parallel, otherwise they can only be configured manually. In order to ensure stability, it is necessary to check the health status of services, block unhealthy services, and service protection to avoid interruption of services in the game.
Operation and maintenance side:Some scripts need to be written to add more servers, and some tools need to be written to let the servers scale automatically. For automation, tools such as server scaling strategies, automatic server purchase and development, and troubleshooting of faulty servers need to be developed.
Even if the above preparations are made, abnormal conditions may occur:
In the process of server allocation, the scheduling indicators generally refer to the CPU and memory indicators of the server, which may cause some low-CPU and low-memory services to be allocated in a large amount in a short period of time, and the server visits will suddenly rise and hang. To avoid this situation, the CPU utilization rate is usually maintained at a low level.
This matter, whether it is R & D or operation and maintenance, is not a simple matter. The workload is relatively large, and it is uncertain whether the game will explode in the early stage. Generally, small and medium developers will not make these preparations in advance.
(2) Region/server failure
Server failures are relatively common. The usual practice is to monitor the server and remove them immediately.
It is not common for a region or the entire computer room to fail, but the impact area is very wide. Generally, game developers will not consider this point, because to do server cross-region or cross-machine room disaster recovery, at least 2 times the server, input and output relatively low.
3. Cost savings
The costs caused by server idleness are mainly as follows:
- Free resources during peak troughs on daily & weekends & holidays
- The stable operation and decline of the game, the server idle resources
- During the event, there was an explosive growth, and resources were free after the event
The server cost is nothing compared to the operating cost of the game, but it can save a little, right?
2. Challenges to R & D and operation and maintenance
As mentioned above, in order to improve the experience of the game, it will lead to a large workload of research and development, a large amount of operation and maintenance, and a large server cost.
1 . Heavy R & D workload
Service management, local dispatch, disaster recovery across the ground, non-stop service updates, and automatic scaling require huge amounts of research and development workload, and many large companies have gradually improved these supporting tools. For some innovation studios, or entrepreneurs, will focus more on building the game business, doing these tasks is a burden.
2 . Large operation and maintenance workload
Repeatedly expand, shrink, and release versions. If you don't do these things repeatedly, you need to develop some tools/scripts, which is something that requires a lot of upfront investment.
3 . Large server cost
One is the cost from idle resources, and the other is the need to increase the cost of the server by at least 1 times in the traditional way of scheduling nearby and cross-region disaster recovery.
Three, GameServerEngine solution
Tencent Game Server Engine(abbreviated as GSE) provides server hosting services for dedicated games, supports stateful game service deployment and expansion and contraction, realizes service discovery, efficient and flexible server scaling, and nearby scheduling capabilities to help developers Quickly build a stable, low-latency multiplayer game deployment environment, and save a lot of operation and maintenance costs.
It supports the deployment and operation of the Unity engine, Unreal engine and custom game framework. It is used in FPS, MOBA, turn-based, MMORPG, battle uniforms in chess and card games, and message PUSH.
1 . Elastic expansion
The game will have peaks and troughs every day, and there will be dynamic changes in different player curves every year on holidays, weekends, etc., which has high requirements for the server's scalable scheduling capabilities. The core capabilities of GSE are elastic scaling and resource scheduling capabilities.
Daily fluctuation curve of game day
Game annual fluctuation curve
(1) GSE can scale the server in real time
GSE can set the server instance type and scaling range, and the instance will scale within this range. Game access has peaks and troughs every day, usually at noon and night, the number of server instances will have a peak, after midnight, the number of server instances will be minimized. GSE will automatically scale based on the number of server visits at each moment of the day.
(2) GSE can achieve stateful shrinkage
GSE will not reduce the instances where processes are running. When the scaling is triggered by low load, the game process is notified that the server is being scaled down, and new game server sessions are blocked from being allocated to the server, but the instance cannot be forced to reduce the size of the game. Waiting for no players in the game process in the game, after the end command is initiated, the stop process and the recovery of the server are actually triggered.
The benefits of elastic scaling are:
a. Increase flexibility
- Dispatch nearby, get the server when needed, and return to the server when not needed.
- The same is true for disaster recovery. Go to the server when you need it and return it to the server when you don't need it.
b. Cost savings
- Reduce daily, weekly, and annual cost of idle resources, calculated to save 20%-30%of costs
- Reduce the cost of dispatching nearby
- Reduce the cost of disaster recovery
2 . Nearest dispatch
Elastic scaling is a basic premise. The powerful scheduling resource capability extends on this basis. You can schedule resources in various regions of Tencent Cloud at any time, so that you do not need to reserve server resources in advance in each region, which makes the nearby scheduling simple.
GSE provides the speed measurement from the client to the server, and obtains the delay from the client to all service deployment areas. GSE uses this delay to schedule nearby.
As shown in the figure below, you can see that a group of matched players will be assigned to the nearest server for battle. Beijing, Shanghai, Guangzhou, and Chengdu can all deploy one server at the beginning and configure a scaling strategy so that they can automatically scale when needed.
3 . Multi-site deployment, cross-region disaster recovery
Elastic scaling is a basic prerequisite. GSE can schedule resources of any region and any model of Tencent Cloud at any time, so that disaster recovery can be easily achieved.
The game server queue contains the game server fleet(a group of servers) in various regions. The business only needs to request the game server queue. The game server queue will be based on the health status of each game server fleet and the network delay from the client to the server. It will automatically eliminate the problematic area and choose a normal server to provide services. If the demand is strong in the normal area, it will be automatically expanded. There is no need to deploy the same number of servers in multiple areas in advance, thus achieving the effect of zero-cost disaster recovery.
4 . The difference between GSE and ordinary elastic scaling
GSE focuses on stateful expansion and contraction scenarios.
There are usually two special requirements in the game:disconnection and reconnection, and no exit in the game. And the general game server is stateful, so how to shrink it?
The design of GSE has three protection strategies for game servers:
(1) Full protection:If there is a process running, it will not shrink.
(2) Unprotected:When volume reduction is required, immediately reduce the volume.
(3) Time limit protection:protect for a certain period, such as 1 hour.
5. GSE keeps updating service design
GSE has the ultimate resource scheduling capability, and can easily achieve update without stopping.
The client requests the server under fleetA through the alias alias. When the version is updated, the new fleet FleetB is created, the version is posted to fleetB, and the alias is pointed to the newly created fleet FleetB. The client still calls the same alias alias. But access to the FleetB version. FleetB gradually expands, and FleetA gradually shrinks.
The above is a brief introduction to the design of Game Server Engine. You can also use Elastic Scaling, Elastic Scaling + Local Scheduling, Elastic Scaling + Disaster Recovery separately. The product does not invade the game framework, logic code, supports unity engine, unreal engine, custom server framework, open source framework operation, supports C ++, C # language, supports JAVA, PHP, python, lua, Nodejs and other languages supporting grpc.
If your game is in the process of project selection and selection, will you choose this brand new game deployment framework?