Learn to be distributed in one go: CAP theory, distributed transactions, distributed locks, distributed IDs, service current limiting, etc. Full of dry goods! ! !

Posted May 25, 202014 min read

This article collects the basic theory of distributed for you, lists the common distributed application scenarios:distributed locks, distributed transactions, distributed primary key ID(Meituan Leaf), service current limit algorithm, consistent Hash algorithm. At the same time, the author of each part has collected and analyzed in detail and analyzed the advantages and disadvantages of each scheme. The author hopes that this article can become a more comprehensive summary of network speed, and can bring a more systematic explanation for readers. If readers find that there are incomplete collections and errors in the article, please leave a message in the comment area, and the author will improve the content of the article as soon as possible. Thank you!

CAP theory:

  • C(consistency) If the data is updated at a certain node, then if the latest data can be read at other nodes, then it is called strong consistency. If there is a node that has not been read, it is distributed. The formula is inconsistent. That is, all nodes see the same data at the same time
  • A(availability):The non-faulty node returns a reasonable response within a reasonable time. At any time, reading and writing are successful.
  • P(Partition Fault Tolerance):When message loss or failure occurs in some nodes, the distributed system can still work normally.

CAP theory believes that:A distributed system can only meet two items at most, consistency, availability, and partition fault tolerance. Since partition fault tolerance is inevitable, most distributed software systems make trade-offs between CP and AP

  • Zookeeper uses CP consistency, emphasizes consistency, and weakens availability.
  • Eureka uses AP availability, emphasizes availability, and weakens consistency.

Base theory

Base Theory:Basically Available, Soft State, Eventually Consistent. Since it is impossible to achieve strong consistency, different applications can use appropriate methods to achieve the final consistency according to their own business characteristics. Base theory is a practical application of CAP theory

  • Basic availability:It does not pursue strong availability, and emphasizes that the system can basically run to provide external services. When the distributed system encounters unpredictable failures, it is allowed to be unavailable to a certain extent, such as:limiting the queue of requests, Make some users have longer response time or downgrade non-core services.
  • Soft state:for the atomicity of transactions in the database:either all succeed, or all unsuccessful. Soft state allows data in the system to have an intermediate state.
  • Final consistency:The data cannot always be in a soft state, and the consistency of each node must be reached after a period of time. After this, the data of all nodes are consistent. The system reaches final consistency.

Distributed consensus algorithm

WARO:Write All Read One

A simple copy control protocol. When the client sends a write request to a distributed application, the write operation is considered successful only after all the copy nodes have been updated successfully. Otherwise, it is considered a failure. This reduces the usability of write operations and improves the usability of read operations.

Quorm:final consistency

Assuming there are N copies, when the client sends a write request to a distributed application, if there are W copies successfully updated, this write operation is considered successful. Then the read operation needs to read up to N-W copies to read the latest result.
Quorm cannot guarantee strong consistency, it is a mechanism commonly used in distributed systems to ensure the ultimate consistency of the voting algorithm of data redundancy. Kafka's ISR mechanism is somewhat similar to this mechanism.

Paxos algorithm:distributed consensus algorithm

In the Paxos protocol, there are three types of role nodes

  • Proposer
    There can be multiple proposers. At the beginning of the process, the proposer proposes an operation(called value)(such as modifying the value of a variable), and multiple proposers can propose different operations. However, after a round of Paxos algorithm, only one proposal's operation was executed.
  • Acceptor
    In the cluster, there are multiple approvers(the number is set to N). Approvers are completely independent and equal. The operation proposed by the sponsor must be approved by more than half(N/2 + 1) of the approvers before it can be executed
  • Learner
    The learner does not participate in the election, but performs the operations approved by the approvers.

Distributed transactions:

The distributed transaction solution has a two-phase commit protocol, a three-phase commit protocol, TCC segmented commit, and final consistency based on message queues

Two-phase submission agreement(2PC)

  • In the two-phase submission system, there is a node as a coordinator and other nodes as participants.
  • All nodes use pre-written logs. Log records will not be lost.
  • All nodes will not be permanently down, and can be recovered even after the downtime.
  1. The first stage:the transaction manager requires each database involved in the transaction to pre-submit this operation and reflects whether it can be committed.
  2. Second stage:Based on the feedback from the first stage, the transaction coordinator requires each database to submit data, or roll back the data.

Disadvantages:

  1. The transaction manager is a single point, and the entire database cluster cannot be used after a failure
  2. During the execution process, all the nodes participating in the transaction are in the exclusive state of the transaction. When some participants occupy the public resources, then the access of other third-party nodes to the public resources will be blocked.
  3. In the second phase, it is assumed that the notification that the coordinator has issued the transaction commit is only received and executed by some of the participants, and the remaining participants have been blocked because they have not received the notification. Sex.

Three-phase submission agreement(3PC)

In order to solve the problem of congestion of public resources in the two-phase submission agreement, the coordinator and participants in the three-phase submission agreement introduced a timeout mechanism, and then split the first phase of the two-phase submission agreement into two steps:first Inquire(CanCommit), then lock the resource(PreCommit), and finally submit(DoCommit).

  1. CanCommit:The coordinator sends a Commit request to the participants, and the participating nodes reflect whether they can adjust.

  2. PreCommit:According to the CanCommit response, there are the following two implementations.

    • If all participating nodes return Yes, the transaction is pre-executed:the coordinator sends a PreCommit request to the participant, so that the participant enters the Prepare phase. And feedback ACk to the coordinator.
    • If any node returns NO, or wait for timeout to interrupt operation. Then the coordinator sends an abort request to all participants, and the participant executes the abort request to abandon the execution of the transaction.
  3. DoCommit:There are also two implementations based on the PreCommit response.

    • If the coordinator receives ACk responses from all participants, it sends a doCommit request, all participants submit transactions to release resources, and feedback an ACK response to the coordinator.
    • If the coordinator does not have ACK response to all participants, it will execute the interrupt transaction

Disadvantages:In the DoCommit phase, it is assumed that the notification that the coordinator issued the transaction commit was only received and executed by some participants, and the remaining participants have been blocked because they did not receive the notification. Data inconsistency.

TCC

TCC:TCC is a distributed transaction solution proposed by Alipay. Each participant of distributed transaction needs to implement three interfaces:try, confirm, cancel.

  1. Try phase:The caller calls the try interface of each service, each service performs resource check and lock to see if it has the ability to complete, and if it is allowed, the resource is locked.
  2. Confirm stage:The try interface of each service returns yes, then enter the submission stage, the caller calls the confirm interface of each service, and each service processes the resources locked in the try stage. If one of the parties in the try phase returns no, or times out, the caller calls the cancel interface of each service to release the locked resources
  3. Cancel phase:Cancel the execution and release the business resources reserved in the Try phase
  • Confirm stage and Cancel stage are mutually exclusive, only one can be carried out, and they are both idempotent, allowing failure to retry.

advantage

1. TCC solves the problem of cross-service business operation atomicity, allowing applications to define the granularity of database operations themselves, reduce lock conflicts, and improve system business throughput.
2. Each stage of TCC is controlled by the business itself, avoiding long transactions and improving performance.

Disadvantages

1. Strong business intrusion:The business logic must implement Try, Confirm and Cancel operations

abnormal situation

  • Empty rollback
    The phenomenon is that try is not executed, and cancel is called:an exception occurs when calling try, the try interface is not actually called, and naturally returns no yes, then it will enter the second stage according to the normal process and call the cancel interface. roll.
    Solution:Let cancel recognize that this is an empty rollback, record the transaction execution status, and determine whether try is executed in cancel.
  • Repeated call
    When the submission phase is abnormal, confirm and cancel may be called repeatedly. Therefore, idempotency should be implemented to ensure that the effect of multiple executions is consistent.

Solution:record the transaction execution status, if it has been executed, it will not be executed.

Idempotency of interface:refers to the fact that the data obtained by the interface is consistent when the caller calls multiple times. The query interface is naturally idempotent.

  • Suspension
    The phenomenon is that the cancel is executed first, and then the try is executed, which causes no one to release the resources:the network congestion timed out when calling try, which is considered to be a failure, and then cancel is called. At this time, the transaction is equivalent to the end, but then try to execute after the network is better At this point, the related resources are locked, because the transaction has ended, and confirm and cancel will not be called again, causing the resource to hang and cannot be released.
    Solution:still record the transaction execution status, and determine whether cancel is executed when try is executed.

MySql internal XA transaction specification

XA transactions are based on a two-phase commit protocol. The XA specification mainly defines the interface between the transaction coordinator and the resource manager.

  1. Transaction coordinator:used to ensure that all transaction participants have completed the preparation work. If the transaction coordinator receives a message that all participants are ready, it will notify all transactions that they can commit.
  2. Resource manager:responsible for controlling and managing actual resources.

XA transaction execution process is similar to the two-phase commit agreement.

  1. Prepare stage:The transaction manager sends a prepare command to all resource managers, and the manager performs data operations and log records after receiving the designation. Then feedback the results.
  2. Commit phase:The transaction coordinator receives the results of all resource managers and chooses to execute the RollBack command or the Commit command. Complete a transaction operation.

There are two cases of XA transactions in MySQL, internal XA and external XA. If the transaction occurs on the MySQL server standalone, use internal XA, if the transaction occurs on multiple external nodes, use external XA.

Internal XA: Mysql will maintain both binlog log and InnoDB redolog. In order to ensure the consistency of the two logs, MySql uses XA transactions. When a transaction is submitted:

1. The first step:InnoDB enters the Prepare phase, and writes the XID of the transaction to the redo log. binlog does nothing.
2. Step 2:Write the binlog log and also write the XID to the binlog.
3. The third part:call Comno of the InnoDB engine to complete the transaction submission. Then write Commit information to the redo log.

Distributed lock implementation scheme summary

Database-based implementation of the primary key

Acquire lock:When you want to lock a certain resource, insert a corresponding record in the table.
Release lock:Delete the corresponding record.
The implementation of the distributed lock based on the database is simple to implement, but there are many problems.

  1. There is a single point of failure:once the database hangs, the entire business system is unavailable. You can configure the master and slave to prevent single points of failure.
  2. Timeout problem:If the unlock operation fails, the lock will remain in the database and other threads cannot obtain the lock. Independent timed tasks can be added, and overtime data can be deleted through time stamps and other methods.
  3. Not reentrant:the same thread cannot acquire the lock again before releasing the lock. To achieve reentrance, it is necessary to modify the locking method, increase storage and determine thread information.
  4. Blocking and waiting problem:when other threads request the corresponding resources, the data insertion fails, and they will return directly without blocking the thread. Therefore, loop insertion judgment should be done in the thread, which is a waste of resources for database operations.
  5. Master-slave problem:In the scenario of high concurrency, the master-slave delay of the database increases, and the data read by the thread is not the latest version, resulting in duplicate locks.

Implementation scheme based on ZooKeeper

Using the characteristics of temporary sequential nodes supported by Zookeeper, distributed locks can be achieved.

Exclusive lock-use temporary node to achieve

Acquiring a lock:When locking a resource, Zookeeper generates a unique temporary node under the specified node directory corresponding to the resource. Other clients set a Watcher notification for this node.
Release lock:Zookeeper deletes the corresponding temporary node, and other clients can listen to the notification that the node was deleted and re-compete for the lock.

Read-write lock-use temporary ordered node to achieve

Acquire a read lock:

  1. Obtain a temporary ordered node and mark it as a read lock
  2. Get all the child nodes in the resource path, sorted from small to large.
  3. Get the nearest write lock node before the current neighbor node.
  4. If there is no adjacent write lock node, the read lock is successfully obtained
  5. If there is an adjacent write lock node, set Water to listen for the delete event of the node.
  6. Once the delete event is monitored, repeat steps 2, 3, 4, and 5.

Acquire Write Lock

  1. Create a temporary ordered node and mark it as a write lock.
  2. Get all the child nodes under the path and sort them from small to large.
  3. Get the neighboring write lock node and read lock node of the current node.
  4. If there are no neighboring nodes, the lock is successfully acquired.
  5. If there are neighboring nodes, monitor and delete them.
  6. Once the delete event is monitored, repeat steps 2, 3, 4, and 5(recursively).

Release lock
Delete the corresponding temporary node.

Implementation scheme based on Redis

Principle:Before acquiring the lock, first check whether the value corresponding to the lock exists as the key. If it exists, it indicates that the lock has been acquired by another client.

Improvement 1:In order to prevent the client acquiring the lock from suddenly going down, it is necessary to specify an expiration time when setting the key to ensure that the lock can be finally released even if it goes down. Use the SETNX command to set the value of the key, and the EXPIRE command to set the expiration time.

Improvement 2:Since the execution of SETNX and EXPIRE commands is not atomic, multiple clients will think that they can acquire the lock when they check whether the lock exists. Redis provides
Set atomic command, specify the expiration time while setting the value.

Improvement 3:After the client acquires the lock, the task has not been completed, but the lock has expired and was acquired by another client. At this time, the client will release the lock and cause the lock to become invalid. You can set the value to a random value r when setting the key. When deleting, first compare whether the value in redis is r before deciding to delete.

Improvement 4:The client first compares whether the value of redis is r before deciding to delete it, but since the lock is not deleted after the comparison, it is not atomic. In the comparison process, the key may be cleared due to expiration, so that a client can obtain the key. Redis does not provide related atomic operations for deletion after comparison. At this time, the process of releasing the lock can use the lua script, and redis treats the command of the lua script as an atomic operation.

Distributed unique ID generation series

UUID:

UID uses Ethernet card address, nanosecond time, chip ID code and many possible numbers to generate a string of unique random 32-bit data.
Advantages:Good performance, locally generated, globally unique.
Disadvantages:

  1. The UUID length is fixed at 32 bits. For Mysql indexes, all non-primary key indexes will contain a primary key. If the UUID length is too long, it will be detrimental to MySql storage and performance.
  2. The UUIDs are out of order, and every time the UUID data is inserted, the b + tree of the primary key city is greatly modified.
  3. The information is not secure. The UUID contains the MAC address and chip ID code. Will cause information leakage.

Database self-increment ID

For multiple databases, the global self-increment ID can be realized by the span of the initial value increment and self-increment of each database. Take 4 databases as an example, the following table

Database number Starting value increment Self-increment span Generated primary key sequence
1 1 4 \ [1,5,9,13,17,21,25,29 ..... ]
2 2 4 \ [2,6,10,14,18,22,26,30 .... ]
3 3 4 \ [3,7,11,15,19,23,27,31 .... ]
4 4 4 \ [4,8,12,16,20,24,28,32 .... ]

Advantages:It is easy to store and can be stored directly in the database.
Disadvantages:

  1. It is difficult to expand the system horizontally. After defining the step size and the number of machines, increasing the database requires readjusting the span of all database initial value increments and self-increment.
  2. The database is under pressure, and the database will be written once every time the ID is obtained.
  3. The information is insecure and too incremental. It is easy to judge the amount of orders in the middle of competitors based on the difference between the two IDs.

Snowflake

The result of snowflake generated id is a 64bit integer. It consists of one identification bit, a 41-bit time stamp, a 10-bit machine bit, which can identify 1024 machines, and a 10-bit self-increasing serialization. The structure is as follows:

image

Advantages:Increasing trends, do not rely on third-party components(databases, etc.), higher performance, can dynamically allocate bits according to their business characteristics
Disadvantages:Strongly dependent on the machine clock. If clock callback occurs, the ID generated by the entire system will be unavailable.

Leaf

Leaf provides two modes.

Segment mode

Segment mode is optimized on the basis of the previous database scheme. This mode does not operate the database once every time you obtain an ID, but asynchronously fetches N IDs from the database to form a number segment and then puts it in the local cache. At the same time, the double buffer method is used. When a certain percentage is issued in the first number segment, another thread will start to obtain and update the cache data of the next number segment.

advantage:

  1. The Id increases monotonously, and the internal numbered segment cache allows the database to be hung for a period of time.

Disadvantages:

  1. If the number segment is too short, the DB downtime tolerance time will be shortened. If the number segment is too long, the ID number span will be too large. The span of the number segment can be adjusted dynamically according to the usage of the number segment.
  2. In the end, it is still strongly dependent on DB

Snowflake mode

The Snowflake of Meituan Leaf is like the ordinary Snowflake, with two improvements.

  1. The 10-digit workID is the sequence number of the sequence node registered in zookeeper.
  2. Clock callback problem:The application will periodically write its own time to zookeeper, and at the same time compare the local time with the time stored by zookeeper. If the difference exceeds the set threshold, it is considered that the local service time callback occurs.

Service current limit algorithm:

Counter

When the first request comes in, the timer is counted. In the next 1s, the count is increased by 1 for each request. If the accumulated number reaches 100, all subsequent requests will be rejected. After the end of 1s, restore the count to 0 and restart the count. It can be easily achieved using redis' incr atomic self-increment and thread safety.
If I have passed 100 requests in the first 10ms of the unit time of 1ms, then the next 990ms, all the requests will be rejected, that is:the spike phenomenon.

Sliding window algorithm

The sliding window algorithm divides the time period into N small periods, records the number of visits in each small period, and deletes the expired small period according to the time sliding. The more the sliding window is divided, the more the sliding window scrolls. Smoother, the more accurate the statistics of current limit

Leaky bucket algorithm

There is a container inside the algorithm, no matter how large the flow is above, the speed of the outflow below remains constant. You can prepare a queue to save requests, and get requests from the queue through a thread pool and execute them regularly. You can get multiple concurrent executions at once.

Token bucket algorithm:

There is a mechanism in the algorithm to put tokens into the bucket at a certain rate. Each request call needs to obtain the token first, and only when you get the token will you have the opportunity to continue the execution, otherwise you choose to wait for the available token or directly refuse. You can prepare a queue for storing tokens. In addition, a thread pool periodically generates tokens and puts them in the queue. Every time a request comes, a token is obtained from the queue and execution continues. Guava's RateLimiter can simply generate a token current limiter.
Cluster current limit:Each time there is a related operation, an incr command is sent to the redis server. For example, to limit the number of times a user accesses the/index interface, only the user id and interface name need to be stitched to generate the redis key. When users access this interface, they only need to execute the incr command on this key, and bring the expiration time on this key to achieve the specified frequency of access.

Consistent Hash algorithm:

Use Hash algorithm to make a fixed part of the requests fall on the same server, so that each server processes a part of the requests to play a role in load balancing. However, the ordinary hash algorithm has poor scalability. When new or offline server machines are added, the mapping relationship between the user id and the server will be largely invalid. Consistent hashing uses hash rings to improve it.
Consistent hash:treat all server hash values as a clockwise ring starting at 0, and then see that the requested hash value falls to the place of the hash ring. Find the nearest ip clockwise on the position of the hash ring Processing server

image.png