Today we will talk about the high availability of the Internet three highs (high concurrency, high performance, high availability), after reading this article, I believe it will solve most of your confusion about high availability design.

I. Introduction

The main purpose of high availability (HA) is to ensure “business continuity”, that is, in the eyes of users, the business will always be normal (or basically normal) to provide services to the outside world. High availability is mainly for the architecture, then to do a good job of high availability, we must first design a good architecture, the first step we will generally use the idea of layering to split a huge IT system into an application layer, middleware, data storage layer and other independent layers, each layer and then split into more fine-grained components, the second step is to let each component provide services to the outside world, after all, each component is not isolated, need to cooperate with each other, external services are meaningful.

To ensure the high availability of the architecture, it is necessary to ensure that all components in the architecture and its exposed services must be designed for high availability, and any component or its services are not highly available, which means that the system is at risk.

So how to do so many components how to do high-availability design, in fact, any component to do high-availability, are inseparable from “redundancy” and “automatic failover”, as we all know that a single point is the enemy of high availability, so the components are generally in the form of clusters (at least two machines) exist, so that as long as a machine has a problem, other machines in the cluster can be replaced at any time, which is “redundancy”. To do a simple calculation, assuming that the availability of one machine is 90%, the cluster availability of two machines is 1-0.1*0.1 = 99%, so obviously the more redundant machines, the higher the availability.

But redundancy is not enough, if the machine has a problem, the need for manual switching is also time-consuming and laborious, and easy to make mistakes, so we also need to use the power of third-party tools (that is, arbitrators) to achieve “automatic” failover, in order to achieve the purpose of near-real-time failover, near-real-time failover is the main meaning of high availability.

What kind of system can be called high availability, the industry generally uses several nine to measure the availability of the system, as follows:

Generally achieve two 9 is very simple, after all, the daily downtime of 14 minutes has seriously affected the business, such a company sooner or later to stop cooking, the factory generally requires 4 9, other demanding business to reach more than five nine, such as if because of a computer failure caused all trains to stop, then there will be tens of thousands of people normal life hindered, this situation requires more than five nine.

Let’s take a look at how the various components of the architecture achieve high availability with the help of “redundancy” and “automatic failover”.

Second, the analysis of the Internet architecture

At present, most of the Internet will use microservices architecture, and the common architectures are as follows:

You can see that the architecture is mainly divided into the following layers

Access layer: F5 hardware or LVS software primarily carries all traffic ingress

Reverse proxy layer: Nginx, which is mainly responsible for distributing traffic based on URLs, throttling flow, etc

Gateway: mainly responsible for flow control, risk control, protocol conversion, etc

Site layer: It is mainly responsible for invoking basic services such as members and promotions to assemble data such as json and return it to the client

Basic service: In fact, the site layer belongs to the microservice, which is a peer relationship, but the basic service belongs to the infrastructure and can be called by the servers of the upper layer of each business layer

Storage layer: That is, DB, such as MySQL, Oracle, etc., is generally returned to the site tier by the underlying service call

Middleware: ZK, ES, Redis, MQ, etc., mainly play a role in accelerating access to data and other functions, in the following we will briefly introduce the role of each component

As mentioned earlier, to achieve high availability of the overall architecture, it is necessary to achieve high availability of components at each layer, so let’s take a look at how each layer of components is highly available.

Third, the access layer & reverse proxy layer

The high availability of both tiers is related to keepalived, so let’s look at it together

Externally, the two LVS in the form of the main and standby external services, note that only the master is working (that is, the VIP at this time is effective on the master), and the other backup will take over the master’s work after the master is down, then how does the backup know whether the master is normal, the answer is through keepalived, on the main and standby machines are installed with keepalived software, After starting, it will detect each other’s health through the heartbeat, once the master is down, keepalived will detect, so that the backup automatically turns into a master to provide external services, then the VIP address (that is, in the figure) is effective on the backup, which is what we often call “IP drift”, which solves the high availability of LVS in this way.

Keepalived’s heartbeat detection is mainly detected by sending ICMP packets, or by using TCP port connection and scan detection, and it can also be used to detect Nginx-exposed ports so that if some Nginx is not healthy, Keepalived can detect and exclude them from the list of services that LVS can forward. Nginx can also detect the health status of the service through the port.

Borrowing keepalived as a third-party tool, while achieving high availability of LVS and Nginx, and at the same time in the event of a failure, you can also send the downtime to the corresponding developer’s mailbox so that they can receive timely notification processing, it is indeed very convenient, Keepalived is widely used, and we will see below that it can also be used on MySQL to achieve the high availability of MySQL.

4. Microservices

Next, let’s look at the “gateway”, “site layer”, “basic service layer”, these three are generally what we call microservice architecture components, of course, these microservice components also need to be supported by some RPC frameworks such as Dubbo to communicate, so microservices to achieve high availability, that means that dubbo These RPC frameworks also provide the ability to support the high availability of microservices, we will take dubbo as an example to see how it achieves high availability.

Let’s first take a brief look at the basic architecture of dubbo:

The idea is also very simple, first of all, the provider (service provider) registers the service with the Registry (such as ZK or Nacos, etc.), and then the Consumer (service consumer) subscribes to the registry and pulls the list of Provider services, and after obtaining the list of services, Consumer can select one of the providers according to its load balancing policy to make a request to it, when one of the providers is available When it is not available (offline or because of GC blocking, etc.), it will be monitored by the registry in time (through the heartbeat mechanism) and will be pushed to the consumer in time, so that the Consumer can remove it from the list of available providers, which also realizes the automatic transfer of the failure, and it is not difficult to see that the registry center plays a similar keepalived role.

5. Middleware

Let’s take a look at how these middleware such as ZK, Redis, etc. achieve high availability.

In the previous section of microservices we mentioned the registry, then we take ZK (ZooKeeper) as an example to see how its high availability is implemented, first look at its overall architecture diagram as follows:

The main characters in Zookeeper are as follows

1) Leader: That is, the leader, there is only one Leader in the cluster, which mainly undertakes the following functions

The sole dispatcher and handler of transaction requests ensures the sequentiality of cluster transaction processing, and all follower write requests are forwarded to the Leader for execution to ensure transaction consistency

The dispatcher of each server in the cluster: After processing the transaction request, the data broadcast will be synchronized to each Follower, count the number of successful follower writes, more than half of the Follower write success, the Leader will consider the write request to be submitted successfully, notify all Follower commit this write operation, to ensure that even if the cluster crash recovery or restart afterwards, the write operation will not be lost.


Handles client non-transactional requests and forwards transactional requests to the leader server

Participate in the voting of the Proposal request (more than half of the servers need to pass to notify the leader commit data; Leader-initiated proposal, asking Follower to vote)

Vote for the Leader election

Voice-over: A new Observer role was added after Zookeeper 3.0, but it is not very relevant to the ZK high availability discussed here, so it is omitted to simplify the problem.

You can see that because there is only one Leader, obviously, this Leader has a single point of hidden danger, then how ZK solves this problem, first of all, Follower and Leader will keep connected with the heartbeat mechanism, if the Leader has a problem (downtime or because of FullGC and other reasons can not respond), Follower can not perceive the Leader’s heartbeat, will think that Leader is in trouble, so they will initiate a vote, In the end, a leader was chosen out of several followers (Zookeeper Atomic Broadcast, the ZAB protocol, which is a consistency protocol designed specifically for ZK to support crash recovery), and the details of the election are not the focus of this article, so I will not elaborate here.

In addition to the ZAB protocol, there are also protocol algorithms such as Paxos and Raft commonly used in the industry, which can also be used in Leader election, that is, in a distributed architecture, these protocol algorithms assume the role of “third party” that is, the arbiter, to bear the automatic transfer of failure.

The high availability of Redis needs to be seen according to its deployment mode, which is mainly divided into two types: “master-slave mode” and “Cluster shard mode”.

1) Master-slave mode

First, let’s look at the master-slave mode, the architecture is as follows:

Master-slave mode is a master and multiple slaves (one or more slave nodes), where the master node is mainly responsible for reading and writing, and then the data will be synchronized to a plurality of slave nodes, the client can also initiate read requests to multiple slave nodes, which can reduce the pressure on the master node, but as with ZK, because there is only one master node, there is a single point of hidden danger, so it is necessary to introduce a third-party arbitrator mechanism to determine whether the master node is down and quickly select a slave node to act as the role of the master node after determining that the master node is down. This third-party arbitrator is generally referred to as “sentinel” in Redis, although the sentinel process itself may also hang up, so for security reasons, multiple sentinels (i.e. sentry clusters) need to be deployed.

These sentries use the gossip protocol to receive information about whether the master server is offline or not, and use the Raft protocol to elect a new masternode after determining that the master node is down.

2) Cluster sharded cluster

The master-slave mode seems perfect, but there are several problems:

The pressure of the master node to write is difficult to reduce: because only one master node can receive write requests, if the write request is high, it may fill the network card of the master node if it is high, causing the master node to be unable to serve the outside world

The storage capacity of the master node is limited by the storage capacity of the single machine: because whether it is a master node or a slave node, the storage is a full amount of cache data, so as the volume of traffic grows, the cache data is likely to rise sharply until the storage bottleneck is reached

Synchronization storm: Because the data is synchronized from master to slave, if there are multiple slave nodes, the pressure on the master node will be great

In order to solve the above problems of master-slave mode, sharding clusters came into being, the so-called sharding cluster is to shard the data, each shard data is read and written by the corresponding master node, so that there are multiple master nodes to share the pressure of writing, and each node only stores part of the data, which also solves the problem of stand-alone storage bottleneck, but it should be noted that each master node has a single point of problem, so it is necessary to do high availability for each master node, and the overall architecture is as follows:

The principle is also very simple, after the proxy receives the redis read and write command executed by the client, the key will first be calculated to get a value, if the value falls within the range of values that the corresponding master is responsible for (generally each number is called a slot, Redis has a total of 16384 slots), then send this redis command to the corresponding master to execute, you can see that each master node is only responsible for processing part of the redis data, At the same time, in order to avoid the single point problem of each master, it is also equipped with multiple slave nodes to form a cluster, and when the master node goes down, the cluster will elect a master node from the slave node through the Raft algorithm.

Let’s take a look at how ES achieves high availability, where data exists in the form of shards, as shown in the following figure, where index data is divided into three sharded stores in one node:

But if there is only one node, there is obviously the same single point problem as the master-slave architecture of Redis, this node is hung, ES is also hung, so obviously need to create multiple nodes.

Once multiple nodes are created, the advantages of sharding (P in the figure as the main shard, R as the replica shard) are reflected, and the shard data can be distributed to other nodes, which greatly improves the horizontal scalability of the data, and each node can undertake read and write requests, and the use of load balancing avoids the single point of read and write pressure.

ES’s write mechanism is somewhat different from the master-slave architecture of Redis and MySQL (the latter two write requests are directly to the master node, while ES is not), so here is a little explanation of how ES works:

First of all, the working mechanism of the next node, the node (Node) is divided into master node (Master Node) and slave node (Slave Node), the main responsibility of the master node is to be responsible for the relevant operations at the cluster level, manage cluster changes, such as creating or deleting indexes, tracking which nodes are part of the cluster, and deciding which shards are assigned to the relevant nodes, the master node is also only one, generally elected by the class Bully algorithm, if the master node is not available, Other slave nodes can also be elected through this algorithm to achieve high availability of the cluster, and any node can receive read and write requests to achieve load balancing purposes.

Let’s talk about the working principle of sharding, sharding is divided into primary shards (that is, P0, P1, P2 in the graph) and replica shards (Replica Shard, that is, R0, R1, R2 in the graph), the main shard is responsible for the write operation of the data, so although any node can receive read and write requests, if this node receives a write request and does not write the data in the main shard, the node will dispatch the write request to the node where the main shard is located, after writing to the main shard, The primary shard then copies the data to the replica shards of other nodes, taking a cluster with two replicas as an example, and the write operation is as follows:

ES uses data sharding to improve high availability and horizontal scalability of the idea is also applied to the architecture design of other components, we take Kafka in MQ as an example to look at the application of data sharding:

Kafka high-availability design, image from “Wu Ge Man Talk about IT”

As above, Kafka clusters, you can see that the partition of each topic is distributed on other message servers, so that once a partition is unavailable, you can elect a leader from the follower to continue the service, but unlike the data shards in ES, the follower partition is a cold standby, that is, under normal circumstances, it will not be served externally, only after the leader is hung up A leader is elected in a follower before it can provide services to the outside world.

Sixth, the storage layer

Next, let’s look at the last layer, the storage layer (DB), here we take MySQL as an example to briefly discuss its high-availability design, in fact, if you read the above high-availability design, you will find that MySQL’s high availability is just so, the ideas are similar, similar to Redis, it is also divided into master-slave and shard (that is, we often say database sub-table) two architectures.

Master-slave words are similar to LVS and generally use keepalived forms to achieve high availability, as follows:

If the master is down, Keepalived will also find out in time, so the slave library will upgrade the master library, and the VIP will also “drift” to the original slave library to take effect, so the MySQL address that everyone configures in the project is generally VIP to ensure high availability

After the amount of data is large, it is necessary to split the database and divide the table, so there is a multi-master, just like the Redis shard cluster, which needs to be equipped with multiple slaves for each master, as follows:

Before some readers asked why they should do the master and slave after the database sub-table, now I think everyone should understand that it is not to solve the problem of reading and writing performance, but mainly to achieve high availability.

Seven, summary

After reading the high-availability design at the architectural level, I believe that everyone will have a deeper understanding of the core idea of high availability “redundancy” and “automatic failover”, observing the components in the above architecture you will find that the main reason for redundancy is because there is only one master, why not have more masters, it is not impossible, but it is very difficult to ensure the consistency of data in distributed systems, especially if there are more nodes, synchronization between data is a big problem, so most components use the form of a master, Then synchronize between the master and the multi-slave, and the reason why most components choose a master is essentially a technical tradeoff.

So whether the entire architecture is really available after the high availability of each component, non-also, this can only be said to be the first step, there are many unexpected situations in production that will make our system face challenges, such as:

Instantaneous traffic problems: For example, we may face the instantaneous traffic surge caused by the second sale to cause the carrying capacity of the system to be crushed, which may affect the core links such as daily transactions, so it is necessary to achieve isolation between systems, such as deploying a separate cluster for the second deal

Security issues: such as DDOS attacks, frequent requests by crawlers and even deletion of libraries and runaways, etc., lead to system denial of service

Code issues: such as code bugs causing memory leaks that cause FullGC to make the system unresponsive

Deployment problems: If you hastily abort the currently running service during the release process, it is not possible, and you need to do elegant shutdown and smooth release

Third-party issues: For example, our previous services relied on third-party systems, and third-party problems may affect our core business

Force majeure: such as the power outage of the computer room, so it is necessary to do a good job of disaster recovery, remote work, before our business due to the failure of the computer room caused by the service of four hours is unavailable, heavy losses

Therefore, in addition to the high availability of the architecture, we also need to do a good job in system isolation, flow restriction, circuit breaking, risk control, downgrade, and restricting the operator’s permissions for key operations to ensure the availability of the system.

Here is a special mention of descent, which is a common measure taken to ensure system availability, to give a few examples:

We were unable to borrow money due to problems with the borrowing function of a third-party fund party due to their own reasons, in order to avoid causing panic among users, so we returned a copy like “In order to increase your quota, the funding party is upgrading the system” when the user applied for a third-party loan, avoiding the customer lawsuit.

In the field of streaming media, when users watch live broadcasts with serious stuttering, the first choice of many companies is not to check the log to troubleshoot the problem, but to automatically reduce the bitrate for users. Because compared to the reduced image quality, the card can not see the obviously will make the user more painful.

During the peak period of Double Eleven Zero, we stopped the non-core functions such as user registration and login to ensure the smooth operation of the core process such as placing orders.

In addition, we better be able to do advance defense, before the system problems to kill it in the cradle, so we need to do unit testing, do full-link pressure testing, etc. to find problems, but also need to monitor the CPU, the number of threads, etc., when it reaches the domain value we set, trigger an alarm to allow us to find and fix the problem in time (our company has encountered a similar production accident before the review of you can take a look), in addition, under the premise of unit testing, it is still possible to cause online problems because of potential bugs in the code , so we need to close the net at critical times (such as during Singles Day) (that is, not to let the code be released)

In addition, we also need to be able to quickly locate the problem after the accident and quickly roll back, which requires recording the release time of each time, publisher, etc., and the release here includes not only the release of the project, but also the release of the configuration center.

Voice-over: The picture above is our company’s release record, you can see that there are code changes, rollbacks, etc., so that if you find a problem, you can roll back with one click.

Finally, let’s summarize the common means of high availability with a graph: