Zero data loss in RabbitMQ cross-machine room migration

Posted Jun 26, 20204 min read

I. Background introduction

Most of the company's services have been on private clouds in the past. Because it has been in use for some time, the machine is aging and the cost of operation and maintenance is high. It is planned to bring the entire computer room to the cloud. Because it is responsible for one piece of middleware, RabbitMQ has recently successfully migrated to the cloud. on.

Let's talk about the approximate deployment structure first. To ensure high availability, three Brokers are deployed on the private cloud. The application configures three IPs directly in the configuration file. The client randomly selects one at each request.

The goals of this migration are:

  1. Zero data loss, but there is no guarantee that the message will not be consumed repeatedly;

  2. It does not appear that the entire MQ cluster is unavailable for a long time(more than 2 minutes);

Second, program analysis

Regarding data loss, we must first know what data is in RabbitMQ:

Exchange, Queue, Message.

Regarding Exchange and Queue, you can set it to be created when it does not exist, but this is difficult to control, so generally we are in the background by the administrator to create Exchange and Queue, and set the corresponding properties, generally should be set to persistent, That is, durable is true, which ensures that Exchange and Queue still exist when the broker restarts.

The persistence of Message needs to set the corresponding parameters when sending. If the package amqp-client is used, the code is as follows:

  channel.basicPublish(exchange, routingKey, basicProperties, payload);

Among them, basicProperties is the message property, and the type is AMQP.BasicProperties

public static class BasicProperties extends com.rabbitmq.client.impl.AMQBasicProperties {
        private String contentType;
        private String contentEncoding;
        private Map<String,Object> headers;
        private Integer deliveryMode;
        private Integer priority;
        private String correlationId;
        private String replyTo;
        private String expiration;
        private String messageId;
        private Date timestamp;
        private String type;
        private String userId;
        private String appId;
        private String clusterId;

Pay attention to the deliveryMode attribute, and it will be persistent if it is set to 2 messages.

Let's analyze which scenarios may cause data loss:

  1. Single machine can't start when it is down

Suppose the cluster has 3 brokers:A, B, and C. The mechanism for sending messages is as follows:

If the current request falls on A, the message will be saved on the node A, and the persistence will only be persisted on this machine.

If the message has not been consumed, and A is down at this time, the message is lost.

In response to this situation, the official recommendation is to use the mirror queue solution. At this time, the message sending process is as follows:

The message is first sent to the master machine A where the queue is located, and then A synchronizes the message to all other machines.

At this time, even if A is down, the entire cluster will drift, and the master of this column will be drifted to another machine, because the message has been synchronized to all other machines at the time of sending, so the message will not be lost, but It is possible to repeat consumption, which requires business to do idempotent processing.

The mirror queue adding command is as follows:

Virtual host:"/oneplus"
Name:"all_ha"
Pattern:"^"
Apply to:"Queues"
Priority:0
ha-mode:all
ha-sync-mode:automatic
  1. Split brain

This we have encountered in production, you can add the following configuration to avoid:

{cluster_partition_handling, pause_minority}

Also note a few points during the migration process:

1). Try to ensure that the total number of cluster machines is an odd number;

2). Minimize the time that the cross-machine room cluster exists;

So if we divide it as the above, can we sit back and relax and say that the entire migration is foolproof. The above is just the technical reason, and we have to guarantee the stability of the process through other aspects:

  1. Test

Build a complete set of environmental tests to verify the effectiveness of the entire program.

  1. Make a failure plan

A. Back up the metadata

Can operate in the background of RabbitMQ management interface.

B. The core process must have a data verification plan

Because we are an e-commerce company, we must ensure that the MQ messages on the entire main communication process cannot be lost. Therefore, we need to have a set of data verification and compensation schemes.

Three, the migration process

Finally, let's talk about our entire migration process

  1. The first step is the current state

  1. Add 4 Brokers in the new machine room

Execute the command on each machine as follows:

rabbitmqctl stop_app

  1. There is a Broker in each of the new and old machine rooms, and the total number is still odd

First execute the command on the **machine to be offline:

rabbitmqctl stop_app

Then execute the command on the surviving node

rabbitmqctl forget_cluster_node rabbit@${server1}

  1. Finally, drop the remaining machines in the old machine room

Same as above

Finally, to summarize, we guarantee the stability of the migration process from the following aspects:

  1. Configure the parameters to ensure that the data in the MQ is not lost as much as possible

Mainly done the following work:Configure the mirror queue and prevent brain splitting;

  1. Make full verification in the test environment;

  2. Ensure that the main process message has a verification and compensation plan.

[Introduction to ChaosBlade, a tool for troubleshooting drills]( http://mp.weixin.qq.com/s?__biz=MzU4MzA1NzA1Mg==&mid=2247483874&idx=1&sn=c9327023b3d30ff49086d92a41a7c4aa&chksm=fdafaa11cad82a154dfd

[Online fault handling practice]( http://mp.weixin.qq.com/s?__biz=MzU4MzA1NzA1Mg==&mid=2247483870&idx=1&sn=10d5565f917e3d760242679591018f46&chksm=fdafaa2dcad8233b0a04d74f7811f00cfb56bf

[Redis Stability Practice]( http://mp.weixin.qq.com/s?__biz=MzU4MzA1NzA1Mg==&mid=2247483859&idx=1&sn=ded79573d5e3e96877bdb75ca047e5c5&chksm=fdafaa20cad82336cabcb6269fcb3bbbbbbb

[How to do stability]( http://mp.weixin.qq.com/s?__biz=MzU4MzA1NzA1Mg==&mid=2247483846&idx=1&sn=65d7adaaa5933d05ffdcf3ab560a3d6e&chksm=fdafaa35cad82323e70d9d56f3e21b3b7

[Extended Redis:Add Redis command]( http://mp.weixin.qq.com/s?__biz=MzU4MzA1NzA1Mg==&mid=2247483851&idx=1&sn=d7565f6b70d9fbbdac24d2cf5005a4f3&chksm=fdafaa38cad8232d3d3d3d3d