I. Brief Description

The concept of off-site multi-living and why it is necessary to do long-distance living is not outlined here. There are many concepts, such as what is the same city double live, two places and three centers, three places and five centers, and so on.

Before reading this article, let’s clarify the background so that everyone will not be confused when reading it later.

At present, there are two computer rooms in the first phase of the renovation of many things, namely machine room A and machine room B. Most of the figures in the article will have logos, which means that there are two different machine rooms.

A computer room we define as the central computer room, that is, the computer room that is being used before the multi-live launch. If you say that the central computer room refers to the A computer room. Another B computer room, when describing, may be said to be a unit computer room, that refers to the B computer room.

To be simple in unitization, we can directly think of it as an equipment room, which can complete the closed loop of business in this unit. For example, the user enters the APP, browses the goods, selects the goods to confirm the order, places an order, pays, and views the order information, the whole process can be completed in a unit, and the data is also stored in this unit.

Doing cellularization is nothing more than two reasons, disaster recovery and improving the concurrency of the system. However, it is also necessary to consider the scale of the computer room construction, the cost of technology, hardware and other inputs. I won’t talk about it specifically, but everyone probably understands it.

Second, the transformation point

Before understanding the transformation point, let’s first look at what the current situation of the single computer room looks like, in order to better help everyone understand why these transformations are done.

As shown in the preceding figure, the client’s request will first come to SLB (load balancing), and then to our internal gateway, through which it will be distributed to specific business services. Business services rely on middleware such as Redis, Mysql, MQ, Nacos, etc.

Since you do more work in different places, it is necessary to have different computer rooms in different areas, such as the central computer room and the unit computer room. So the effect we want to achieve is shown in the following figure:

Everyone may feel very simple when you look at the above picture, in fact, it is some commonly used middleware, and then one more machine room deployment, what is the difficulty of this. If you think so, I can only say one thing: the pattern is small.

The user’s request, issued from the client, the user’s request to which computer room, this is the first point we want to transform.

Before doing more work, the domain name will be resolved to an computer room, and after doing more work, the domain name will be randomly resolved to different computer rooms. If following this random way is definitely problematic, it does not matter for the call to the service, because there is no state. But the storage that the service relies on is stateful.

We are an e-commerce business, the user placed an order in the central computer room, and then jumped to the order details, this time the request to the unit computer room, the underlying data synchronization has a delay, an access to report an error: the order does not exist. The user was confused on the spot, the money was paid, and the order was gone.

Therefore, for the same user, as far as possible, complete the service closed loop in one computer room. In order to solve the problem of traffic scheduling, we have developed a DLB traffic gateway based on OpenResty, DLB will dock with the multi-active control center, can know which computer room the currently accessed user belongs to, if the user does not belong to the current computer room, DLB will directly route the request to the DLB in the computer room to which the user belongs.

If you randomly go to the fixed computer room every time and then correct it through DLB, there will inevitably be cross-computer room requests, which will take longer. Therefore, in this area, we have also combined some optimizations with the client, and after the DLB correction request, we will respond to the client directly through the header of the user’s corresponding computer room IP. This way, the next time you make a request, the client can access it directly through this IP.

If the computer room that the user is currently accessing is hung, the client needs to downgrade to the previous domain name access method and resolve to the surviving computer room through DNS.

When the user’s request reaches the unit computer room, theoretically all subsequent operations are completed in the unit computer room. As we mentioned earlier, the user’s request tries to complete the closed loop in one computer room, just as much as possible, without saying all.

This is because some business scenarios are not suitable for dividing units, such as inventory deductions. Therefore, in our division, there is a computer room that is the central computer room, and those businesses that do not do more work will only be deployed in the central computer room, so when the inventory is deducted, it needs to be called across the computer room.

Request in the central computer room, how to know the service information of the unit computer room? So our registration center (Nacos) should do two-way synchronization, so as to get all the service information of the computer room.

When our registration information is copied in both directions, for the central service, it is directly called across the computer room. For unit services, there will be multiple computer room service information, if not controlled, there will be a call to other computer rooms, so the RPC framework should be transformed.

1) Define the route type

The default route

If the central computer room does not have this service, the service in the unit computer room will be invoked, and if the unit computer room does not have this service, it will directly report an error.

Unit routing

If the request is made to the unit computer room, then the traffic rule of this user is in the unit computer room, and all subsequent RPC calls will only call the services in the unit computer room, and no service will report an error.

Center routing

If you request to the unit computer room, then directly call the service of the central computer room, and the service of the central computer room will report an error. Request to the central computer room, then call the local computer room.

2) Business transformation

The business side needs to mark what type of its own interface (Java interface) is, and @HARoute add it to the interface. After the tag is completed, when the Dubbo interface is registered, the route type will be put into the metadata of the interface, which can be viewed in the Nacos background. All methods inside the interface that are called through RPC are routed according to the token type.

If marked as a unit route, our current internal specification is that the first parameter of the method is the lowercase long buyerId, and RPC will judge the user’s computer room according to this value when routing.

The routing logic is as follows:

3) Renovation process

Make a copy of the interface, named UnitApi, and add the long buyerId to the first parameter. The old interface is called in the implementation of the new interface, and the old and new interfaces coexist.

Publish the UnitApi online, at which point there is no traffic.

The business side needs to upgrade the API packages of other domains and switch the calls of the old interface to the new Unit APIs, and the switch control is added here.

After going online, the UnitApi is called through the switch control, and the switch can be turned off if there is a problem.

Offline the old API and complete the switch.

4) Problems encountered

Other scene cut unit interfaces

In addition to the interface directly called by RPC, there is also a large part of it that is generalized through Dubbo, which also needs to cut the traffic to the UnitApi after it is launched, and can only be offline after the old interface has no request volume.

Interface classification

Interfaces are classified, there is no multi-live constraint before, the methods in a Java interface may be various, if your interface is now a unit route, then the first parameter of the method inside must be added buyerId, and other methods without buyerId scenarios must be moved out.

Business-level adjustments

Business-level adjustment, for example, before querying the order only requires an order number, but now it needs buyerId for routing, so the upstream access to this interface needs to be adjusted.

The request has successfully reached the service layer, and the next step is to deal with the database. We have defined different types of databases, defined as follows:


This library is a unit library, which will be deployed in two computer rooms at the same time, each computer room has complete data, and the data is synchronized in both directions.


This library is the central library and will only be deployed in the central computer room.

Central cellulated

This library is the central unit library and will be deployed in two computer rooms at the same time, the center can read and write, and the other computer rooms can only read. The hub writes data and copies it one-way to another computer room.

1) Agent middleware

At present, each business side uses the Sharding middleware in the form of a client, and the version of each business side is not consistent. In the process of multi-live cut-through, it is necessary to ban the database from writing to ensure the accuracy of business data, and if there is no unified middleware, it will be a very troublesome thing.

So we developed the database proxy middleware Rainbow Bridge by deeply customizing ShardingSphere. Each business side needs to access the rainbow bridge to replace the previous sharding method.

2) Distributed ID

For a cellularized library, the data plane performs two-way synchronous replication operations. If you directly use the self-incrementing ID of the table, the following conflict problem will occur:

This problem can be solved by setting different self-increasing steps in different computer rooms, such as the self-increasing step length of the central computer room is odd, and the self-increasing step length of the unit computer room is even. But more troublesome, the follow-up may add more computer rooms. We took a once-and-for-all approach by accessing globally unique distributed IDs to avoid primary key collisions.

Client access

At present, there are two ways to access the distributed ID, one is the jar packet access provided by the infrastructure in the application, the specific logic is as follows:

Rainbow Bridge access

The other is to generate specific table configuration IDs in the rainbow bridge, which supports docking distributed ID services.

3) Business transformation

Cellded library write requests must carry the ShardingKey

When the Dao layer operates on the table, the ShardingKey of the current method is set through ThreadLocal, and then the ShardingKey is put into SQL via Hint through the Mybatis interceptor mechanism to bring it to the Rainbow Bridge. Rainbow Bridge will determine whether the current ShardingKey belongs to the current computer room, if not directly prohibit writing errors.

Here is a simple explanation of why the cutting process should be banned, which is actually similar to the garbage collection of the JVM. If you do not prohibit the operation of writing, then it will continue to generate data, and we cut the flow, we must ensure that the data of the current computer room is synchronized before the effective traffic rules begin, otherwise the user switches to another computer room, the data is not synchronized, it will produce business problems. In addition to the Rainbow Bridge will be banned, the RPC framework will also be blocked according to traffic rules.

A database connection specifies the connection mode

There are two definitions of connection mode, center and unit.

If the app’s data source specifies the connection mode as the hub, the data source can be initialized normally in the central data room. The data source is not initialized in the unit computer room.

If the data source of the application specifies the connection mode as a unit, the data source can be initialized normally in both the central and unit computer rooms.

Here is an explanation of why there is a connection mode design?

In our project, there will be a situation where 2 libraries are connected at the same time, one cell library and one central library. If there is no connection mode, the upper-level code is a copy, and the project will be deployed in the center and the unit two computer rooms at the same time, that is, both places will create data sources.

But in fact, my central library only needs to be connected in the central computer room, because all the operations of the central library are central interfaces, and the traffic must go to the center, and it is meaningless for me to connect in the unit computer room. Another problem is that I do not need to maintain the database information of the central library in the unit computer room, if there is no connection mode, then the rainbow bridge of the unit computer room must also have the information of the central library, because the project will be connected.

4) Problems encountered

The central database cannot be accessed in the cell interface

If an interface is marked as a cell interface, only the cell library can be manipulated. In the past, when there was no multi-work transformation, there was basically no concept of center and unit, and all the tables were put together. After the multi-live transformation, we will divide the database according to the business scenario.

After partitioning, the central library will only be used by the program of the central computer room, and the central library is not allowed to be connected in the unit computer room. Therefore, if the operation of the central library is involved in the unit interface, it will definitely report an error. This piece needs to be adjusted to the RPC interface of the center.

The hub interface cannot access the cell database

Same problem as above, if the interface is central, you can’t operate the cell library inside the interface. The request of the central interface will be forced to go to the central computer room, if there is an operation involving another computer room, you must also take the RPC interface for correct routing, because your central computer room cannot operate the database of another computer room.

Batch query tuning

For example, the batch query is based on the order number, but the order number is not the same buyer. If you just use the buyer of one order as the routing parameter, then some other orders actually belong to another unit, so there may be a problem of querying the old data.

Such a batch query scenario can only be available for the same buyer, and if it is a different buyer, it needs to be called in batches.

Redis is used more in the business, and there are many places that need to be adjusted in the multi-live transformation. For Redis first we clarify a few definitions:

No two-way synchronization is done

Redis does not do two-way synchronization like a database, that is, a Redis cluster in the central computer room and a Redis cluster in the unit computer room. Only a subset of users’ cache data exists in the cluster of each data center, not all of it.

Redis type

Redis is divided into a center and a unit, the center will only be deployed in the center computer room, the unit will be deployed in the center and the unit two computer rooms.

1) Business transformation

Redis multi-data source support

Before multi-live transformation, each application has a separate Redis cluster, after multi-active transformation, because the application is not unitized and the center is split, there will be a situation in an application that needs to connect two Redis situations. One central Redis, one unit Redis.

The Redis package provided by the infrastructure needs to support the creation of multiple data sources and define a common configuration format, and the business side only needs to specify the cluster and connection mode in its own configuration to complete the access. The connection pattern here is the same as the database.

The specific Redis instance information will be maintained in the configuration center without the need for the business side to care, so that when doing the expansion of the computer room, the business side does not need to adjust, and the configuration is as follows:

At the same time, when we use Redis, we must specify the corresponding data source, as follows:

Data consistency

In database caching scenarios, because Redis does not synchronize in both directions, there will be data inconsistency. For example, the user starts in the central computer room and then caches a copy of the data. Cut to the unit computer room, and the unit computer room caches another piece of data. Then switch back to the central computer room, at this time the cache in the central computer room is the old data, not the latest data.

Therefore, when the underlying data changes, we need to invalidate the cache to ensure the eventual consistency of the data. Relying solely on the cache expiration time for consistency is not an appropriate approach.

Here our solution is to use the binlog of the subscription database to invalidate the cache, you can subscribe to the binlog of the local computer room, or subscribe to the binlog of other computer rooms to achieve the cache invalidation of all the computer rooms.

2) Problems encountered

Serialization protocol compatible

After accessing the new Redis Client package, the test environment experienced compatibility issues with old data. Most of the applications are no problem, although some applications use a unified underlying package, but their own customization of the serialization method, resulting in Redis according to the new way of assembly after the use of custom protocols, this piece is also transformed, support multi-data source protocol customization.

The use of distributed locks

The current distributed lock in the project is based on Redis implementation, and when Redis has multiple data sources, the distributed lock also needs to be adapted. In the use of places to distinguish the scene, the default is to use the center Redis to add locks.

However, the operations in the unit interface are all buyer scenarios, so this part needs to be adjusted to lock the unit Redis lock object, which can improve performance. Some other scenarios involve locking of global resources, so use the central Redis lock object for locking.

After the request arrives at the service layer, it interacts with the database and the cache, and the next logic is to send a message out, and other businesses need to listen to this message to do some business processing.

If it is a message issued in the unit computer room, it is sent to the MQ of the unit computer room, and the program of the unit computer room is consumed, which is no problem. But what if the program in the central computer room wants to consume this news? Therefore, MQ, like the database, also needs to be synchronized, and the message is synchronized to the MQ of another computer room, and as for whether consumers in another computer room want to consume, this must be decided by the business scenario.

1) Define the type of consumption

Hub subscription

The central subscription means that the message, whether it is issued in the central computer room or the unit computer room, will only be consumed in the central computer room. If it is issued by the unit computer room, a copy of the unit’s message will be copied to the center for consumption.

Normal subscription

A regular subscription is the default behavior and refers to the nearest consumption. The messages sent in the central computer room are consumed by the consumers in the central computer room, and the messages sent in the unit computer room are consumed by the consumption of the unit computer room.

Unit subscriptions

Unit subscription refers to the message will be filtered according to ShardingKey, no matter which computer room you send the message, the message will be copied to another computer room, at this time both computer rooms have the message. Through ShardingKey to determine which computer room the current message should be consumed, the compliant will be consumed, and the non-compliant framework level will automatically ACK.

Full unit subscription

Full-unit subscription means that no matter which computer room the message is sent in, it will be consumed in all computer rooms.

2) Business transformation

Message sender adjustment

The sender of the message needs to be distinguished by combining business scenarios. If it is a business message of the buyer scenario, the buyerId needs to be put into the message when sending the message, and the specific consumption is decided by the consumer. If the consumer is a unit consumer, then you must rely on the buyerId of the sender, otherwise there is no way to know which computer room the current message should be consumed in.

The message consumer specifies the consumption mode

As mentioned earlier, the central subscription, unit subscription, ordinary subscription, full-unit subscription multiple modes, in the end how to choose is to combine the business scenario to determine, after the configuration of MQ information can be specified.

For example, the central subscription is suitable for your entire service is central, and other computer rooms are not deployed, which is definitely suitable for the central subscription. For example, if you want to clear the cache, it is more suitable for the whole unit subscription, once the data changes, all the cache of the computer room is cleared.

3) Problems encountered

Message power consumption

This point actually does not matter much according to how much work, even if you do not do more work, the message consumption scenario must be processed by power, because the message itself has a retry mechanism. Separately out of the reason is that in the multi-live scenario, in addition to the retry of the message itself will lead to repeated consumption of the message, in addition, in the process of cutting the flow, the message belonging to the part of the user of the cut stream will be copied to another computer room for re-consumption, when re-consumption, it will be based on the point in time to re-launch the message, so it is possible to consume the message that has been consumed before, which must be noted.

Let’s explain why there will be message consumption failures and need to be copied to another computer room for processing during the chopping process, as shown in the following figure:

After the user performs business operations in the current data room, a message is generated. Since it is a unit subscription, it will be consumed in the current computer room. In the process of consumption, a cut operation occurs, and the database is read and written in the consumption logic, but the operation of the cell table carries the ShardingKey, and the rainbow bridge will judge whether the ShardingKey meets the current rules and finds that it does not meet the direct prohibition of writing and reporting errors. All of the messages of this batch of tange-flow users failed. After the traffic is cut to another computer room, if the message is not redelivered, then this part of the message is lost, which is why it is copied to another computer room for the redelivery of the message.

Message order issues in the cutout scenario

As mentioned above, in the process of cutting the flow, the message will be copied to another computer room for re-consumption, and then based on the point in time to play back, if your business message itself is an ordinary topic, if there are multiple messages in the same scene when the message is played back, this order is not necessarily consumed in accordance with the previous order, so here involves a consumption order problem.

If your previous business scenario itself is using sequential messages, then it is no problem, if it is not a sequential message before, there may be a problem here, let me illustrate the following:

There is a business scenario where a function is triggered to generate a message, which is user-level, that is, a user will generate N messages. The consumer will consume these messages for storage, not to store a message to store a piece of data, but the same user will only store one, the message has a state, will be judged according to this state.

For example, the following messages are delivered in a total of 3 messages, and the final result of consumption in the normal order is status=valid.

If the message is re-delivered in another computer room, the order of consumption becomes as follows, and the final result is status = invalid.

There are several solutions:

a. Topics are replaced by sequential messages and partitioned by users, so that each user’s messages are consumed strictly in the order in which they are sent.

b. Power the news, and if it has been consumed, it will no longer be consumed. But here is different from the ordinary message, there will be N messages, if the msgId is stored, so that you can determine whether it has been consumed, but the storage pressure is too large, of course, you can only store the last N to reduce the storage pressure.

c. The optimization of message idempotence so that every time the message sender sends it, it has a version, and the version must be incremented. After the consumer consumes the message, the current version is stored, and before consumption, it is judged whether the version of the message is greater than the stored version, and the conditions are met before consumption, which avoids the pressure of storage and meets the needs of the business.

Job is not used much on our side, and it is all old logic in use, only a few tasks of early morning statistics, and the new ones are connected to our self-developed TOC (Timeout Center) to manage.

Business transformation: the central computer room execution

Since Job is an old system and currently only single-digit tasks are being executed, there is no support for multi-live retrofits at the underlying framework level. The logic of the Job will then be migrated to the TOC.

So we have to retrofit at the business level to support multi-live, and there are two types of retrofit solutions, which are introduced separately:

Two computer rooms perform Job at the same time, when data processing, such as processing the user’s data, through the ability provided by the infrastructure, you can determine whether the user belongs to the current computer room, if the data is executed, otherwise skip this data.

Starting from the business scenario, the job is executed in the early morning, not an online business, and the data consistency requirements are not so high. Even if you don’t process the data in terms of unitization, there is no problem. So we only need to perform the Job in the central computer room, and in the other computer room, we can configure the Job task not to take effect.

But this way needs to sort out the data operation in the job, if there is an operation on the central library, it does not matter, itself is running in the central computer room. If there is an operation on the cell library, it needs to be adjusted to go RPC interface.

TOC is the timeout center we use internally, and when we have a need to trigger a business action at a certain point in time, we can access the timeout center to handle it.

For example, after an order is created, it is automatically cancelled without payment within N minutes. If the business side implements it itself, it will either regularly scan the table for processing, or use MQ’s delayed message. With TOC, we will register a timeout task with TOC after the order is created, specify a certain point in time, and you want to call back me. In the logical logic of the callback, it is necessary to determine whether the order has been paid, and if not, it is canceled.

Business Transformation: Task Registration Adjustment

When registering a timeout center task, the business side needs to identify whether the task meets the criteria for celling. If this task only operates on the central database, then this task can be called back in the central computer room. If this task is a cell database operation, then when registering the task, you need to specify the buyerId, and the timeout center will route to the user’s computer room according to the buyerId when the callback is triggered.

At present, the timeout center will only be deployed in the central computer room, that is, all tasks will be scheduled in the central computer room. If the buyerId is not specified when the task is registered, the timeout center does not know which computer room to call back when it is called back, and the default callback center computer room is called. In order for the timeout center to call back according to the multiactive routing rules, the buyerId must be specified at the time of registration.

Third, the division of services

After reading the above transformation content, I believe that everyone still has a doubt that my service should be divided? Do I want to do cellularization?

First of all, we must sort out according to the overall goal and direction of the whole multi-live, for example, our overall direction is that the core link of the buyer’s transaction must be unitized. Then all the upstream and downstream of this whole link need to be transformed.

The user browses the goods, enters the confirmation order, places the order, pays, and queries the order information. This core link actually involves a lot of business domains, such as: goods, bids, orders, payments, merchants and so on.

Below these already clear business domains, there may be some other business domains that are supported, so we must sort out the overall links and transform them together. Of course, not all of them must be unitized, or they have to look at the business scenario, such as inventory, which is definitely on the core link of the transaction, but it does not need to be transformed, and it must go to the center.

1) Center service

The central service is deployed only in the central computer room, and the database must also be the central library. The entire application can be marked as a hub so that external access to the service’s interfaces are routed to the central data center.

2) Unit service

The unit service will be deployed in the central computer room and the unit computer room at the same time, and the database must also be the unit library. Unit service is the business of the buyer dimension, such as confirming orders and placing orders.

For the buyer dimension of the business, the first parameter on the interface definition must be buyerId because of the routing. The user’s request has been diverted to different computer rooms according to the rules, and only the database in the corresponding computer room will be operated.

3) Center unit service

The central unit service means that the service has both the central interface and the unit interface. And there are also two sets of databases. Therefore, this service is actually deployed in two computer rooms at the same time, but the unit computer room will only have the traffic from the unit interface, and the central interface has no traffic.

Some of the underlying support businesses, such as commodities, merchants belong to the central unit services. The business that supports the dimension is without buyerId, and the goods are generic and do not belong to a certain buyer.

The underlying database of the support type of business is the central unit library, that is, the central write unit reads, and the write request is made in the center, such as the creation and modification of goods. After the operation, it will be synchronized to the database of another computer room. The advantage of this is that we can reduce the time consumption in the core link, if the product is not unit deployment, then browsing the product or querying the product information when placing an order must go to the central computer room for reading. Now it will be the nearest route to the interface call, request to the central computer room to adjust the service of the central computer room, request to the unit computer room to adjust the unit computer room service, the unit computer room also has a database, do not need to cross the computer room.

In the long run, it is still necessary to split up, to separate the business of the center and the business of the unit, so that it will be clearer. For the new students in the definition of the interface, operation database, cache, etc. are all good, because now mixed together, you must know whether the current business of this interface belongs to the unit or the center.

Splitting is not absolute, or that sentence has to start from the business scene. Like the business of buyers and sellers in the order, I think it can be split, and the follow-up maintenance is more convenient. But like the commodity, there are no two roles, that is, the commodity, the addition or deletion of the commodity is also convenient to maintain in a project, but to carry out the classification of the interface, the new, modify, delete the interface mark as the center.

Fourth, the tangent flow scheme

As we mentioned earlier, during the flow cut, it will be forbidden to write, and the MQ message will be copied to another computer room for re-consumption. Next, we will introduce our tangent flow scheme, which can help you understand the entire multi-live exception scenario processing process more deeply.

1) Issue no-write rules

When a flow cut is required, the operator operates through the background of the active-active control center. Before switching the flow, you need to clean up the existing traffic first, and you need to issue the ban rule. The anti-write rules will be issued to the corresponding configuration centers in the center and the two computer rooms of the unit, and the programs that need to be monitored are notified through the configuration center.

2) Rainbow Bridge implements the no-write logic

Rainbow Bridge will use the no-write rule, when the no-write rule is modified in the configuration center, the rainbow bridge can immediately perceive it, and then will be based on the shardingkey carried in the SQL to determine the rule, to see if the current shardingkey belongs to this computer room, if not, intercept.

3) Feedback on the effective result of the ban writing

When the configuration changes are pushed to the Rainbow Bridge, the configuration center perceives the results of the configuration push and then feeds the effective results back to the active-active control center.

4) Push the ban to take effect time to Otter

After receiving all the feedback, the active-active control center will tell Otter the MQ message at all effective time points.

5) Otter performs data synchronization

Otter receives a message that synchronizes data based on the point in time.

6) Otter synchronously completes feedback on the synchronization result

After all data synchronization before the effective time point is completed, it will be fed back to the active-active control center through MQ messages.

7) Issue the latest traffic rules

After the active-active center receives the feedback message that Otter’s synchronization is complete, it will issue traffic rules, and the traffic rules will be sent to DLB, RPC, and Rainbow Bridge. Subsequent user requests are routed directly to the correct machine room.

V. Summary

I believe that everyone has read this article and should have a certain understanding of the transformation of multi-life. Of course, this article does not explain all the multi-live-related transformations, because the scope of the entire transformation is too large. This article mainly talks about some transformation points and processes at the middleware level and business level, and there are other points that are not mentioned. For example, the construction of the computer room network, the release system supports multiple computer rooms, the monitoring system supports the entire link monitoring of the multi-computer room, the monitoring of data inspection and so on.

Multi-live is a highly available means of disaster recovery, but the cost of implementation and the requirements on the technical team are very high. When realizing multi-live, we should combine business scenarios to design, not all systems, all functions must meet the conditions of multi-live, there is no 100% availability, some are just some trade-offs for the business in extreme scenarios, giving priority to ensuring core functions.

These are some of our experiences in multi-life transformation, and I hope to share them will help you as you are reading.