1. Background introduction 

In this article, let’s talk about some of the principles of the message middleware high-availability architecture.

For a qualified senior Java engineer, you will definitely encounter a scenario where MQ (message queue) is used in the system. Then at this time, you need to consider some technical problems that you may encounter when using MQ based on your business scenarios and needs.

Next, you have to design a complete technical solution for these technical problems.

You need to start from all angles such as the subscription mode of the message, the production of the message to the consumption of the whole link without losing data, and how the message middleware itself ensures high availability, etc., to consider the complete technical solution after the docking of your system and MQ.

Therefore, this article will talk about the architectural principle of high availability of message middleware.

2. Let’s first think about the usability of message middleware 

Let’s put aside the specific technologies and think about the usability of MQ.

If you look at the following figure, the reason is actually very simple. Assuming your MQ is deployed on a single machine, under normal circumstances, the producer will send a message to the MQ and then let the consumer get it.

But in case there is an unexpected storm in the day, the machine deployed by MQ is hung up for some inexplicable reason MQ’s own process or the machine is directly down, so what should be done at this time?

Awkward, isn’t it? The results are clear. Producers can’t send data, and then consumers can’t get it.

And then isn’t the whole system finished? Because the core process of the system can’t run through at all, right?

MQ downtime directly causes your system itself to fail, and then may cause your company’s external apps, websites and other products to be unable to operate, and users can not use your company’s services.

If your company is an e-commerce platform, a takeaway platform, a social platform. So this kind of out, will not lead to heavy losses for the company?

If your system lasts for several hours and cannot be used, your company’s e-commerce platform revenue can reach 100 million a day, and now it is impossible to place an order to buy goods within a few hours, and finally the revenue of the day is 50 million, then is your company directly losing 50 million alive?

This is really not a joke, if you pay attention to the news of the Internet industry and the gossip, you should know that in recent years, some large Internet companies have had a similar situation, and the losses have been heavy. If we are code farmers, we have to be sacrificed to heaven, right?

3. Clustered deployment + data multi-copy redundancy

Well, here’s the problem! Now how do you think an MQ middleware should achieve high availability?

There are many ways to do this, such as multiple copies of data redundancy and cluster mirror synchronization mechanism. Let’s put aside the specific technology to think about the several ways in which MQ clustering can achieve high availability.

Let’s look at the following diagram, suppose that the data we write to MQ is redundant with multiple copies, that is, every message you write is copied to another machine.

Then at this time, any machine downtime does not seem to affect our continued communication with MQ, and the data written out seems to be still there.

In the figure above, MQ is deployed to two machines in a clustered mode, and then the producer writes a message to one of the machines, which is automatically replicated to the other machine.

At this time, the data is on two machines, and there are two copies. So if the first machine goes down, will it affect us?

The answer is: No.

Because the data itself is multi-copy redundant, the consumer can completely consume the message from the second machine, and the producer can continue to write messages to the second machine, and the data is not lost.

Moreover, the system does not need to interrupt the process at all, and can continue to run, we look at the following figure.

Isn’t that a great feeling? In fact, this MQ clustered deployment architecture and the redundancy mechanism of multiple copies of data are a very common high-availability architecture.

Kafka, the excellent message middleware, uses this architecture to ensure high availability and data tolerance.

4. Multi-replica synchronous replication mandatory 

But here you have to think about a few other questions.

The first question, when you write data to one of the machines, do you have such a requirement: you must let that machine copy the data to another machine, to ensure that there must be a double copy of this data in the cluster, before you can think that the writing is successful?

That’s right, if you can’t guarantee this, for example, if you write data to one of the machines, and then it has time to copy it to the other machine, the first machine goes down.

At this point, although you can continue to send messages and consume messages based on the second machine, one of the messages you just sent is lost.

Let’s look at the following diagram to understand this scenario.

So for this mechanism, you have to let the producer go through the setting of some parameters to ensure that writing a message to a certain machine, you must synchronize the message to another machine successfully. Wait until there are double replicas in the cluster before you can think that the message was written successfully.

As long as he just wrote a machine he is down, before it has time to copy to another machine, this writing should report an error failure. Then, you should try writing data to the MQ cluster again.

Let’s take a look at the picture below. As long as you write successfully at one time, you must have synchronized the data as a double copy. At this point, even if a machine goes down, data will not be lost, and production and consumption can continue in an orderly manner.

5. Multiple machines hosting multiple copies is mandatory 

The second question is, if you now have two machines in your cluster, and now one of them is down, and there is only one machine, can you still allow your producer to continue writing data to the only machine?

The answer is: no.

Because, if only one machine in the cluster can carry writes, what if the remaining machines go down again? Will it still cause data loss and cluster failure?

Therefore, your producer should be set based on parameters, and there must be more than two machines in the cluster that can receive a copy of your data.

Otherwise, if only one machine can accept a copy of your data, forget it.

Let’s take a look at the picture below and feel the scene.

Assuming that there are 3 machines in the cluster, then one of them is down, and when you write to the other one later, judge that there are two machines left in the cluster, which is enough to ensure the high availability and fault tolerance of the double copy of the data, so you can continue to write data to the MQ cluster normally.

In fact, the whole set of mechanisms mentioned above can be used in Kafka. It has a number of corresponding parameters that can configure several copies of the data, including that you must copy to several machines each time you write to be successful, otherwise it will be resent. You can also set the parameters that the remaining machines in the cluster must be able to host several copies before they can continue to write data.

Through the design of this complete set of solutions and the landing based on specific technologies, it can be guaranteed that in the case of clustered deployment, the cluster must have several machines to carry multiple copies, and the data must be redundant after it is written.

At this point, if any machine goes down, the data will not be lost, and the system can continue to operate normally.

6. Architectural principles are technically independent  

In fact, the discussion of the cluster high-availability architecture of the message middleware in this article is completely detached from a specific technology, and it is very simple to discuss this topic from the essential principle level.

Specific RabbitMQ, Kafka, RocketMQ and other different message middleware, for the implementation of this high-availability architecture, have a certain degree of similarity, but also have their own different technical implementations, and corresponding differences.

– EOF –

1. How does the message middleware MQ handle the message of consumption failure?

2, to the bottom of the matter, Kafka message middleware will not lose the news

3. Dude, why did you introduce message middleware in your system architecture?

Got a harvest after reading this article? Please forward and share it with more people

Follow “ImportNew” to improve your Java skills

Likes and looks are the biggest support ❤️