Basic concepts

Room

Central computer room: In the current single computer room case, in addition to the dual-active business, the long-tail business and the business that did not do more work are in this computer room. 

Unit computer room: The new computer room, that is, the new machine room of dual-active, is used to undertake the main link dual-active capacity traffic of the computer room.

routing

sharding_id route_code, active-active transitions to route_code (quad-out behavior region) according to routing rules. Each route_code corresponds to the central computer room or unit computer room. Gateways, soa, redis, db, etc. are routed to the correct room according to route_code.

Several modes of multi-live

Double life in the same city

Active-active deployment (two IDCs) in the same city.

Off-site double living

Active-active deployment in two cities (one IDC per city).

Live more off-site

Multi-IDC deployments in multiple cities.

Pros and cons

Cellularization

You can consider a problem, a company, or a business in the tps reached hundreds of thousands or millions of bottlenecks in the entire system design, architecture and even the computer room will be extremely prominent. But looking at the whole country or the world, all the tps are not only millions, tens of millions, in the final analysis, it is because of different traffic, in the very beginning according to the company, business, computer room, region route to different computer rooms, and because the company, business is naturally isolated, so each company’s business only needs to deal with their own part of the tps. For example, Taobao traffic will only be in Ali’s application and computer room, and Didi’s traffic will only be in Didi’s application and computer room. However, if a company’s business tps of 100 million, if the level of unlimited expansion can not be achieved, there must be no company can resist such a large concurrency, not only the architecture, even the physical computer room does not allow such a large cluster (power, site will have restrictions). Cellarization, on the other hand, provides architectural capabilities that are theoretically infinitely scalable. 

Cellularization can be understood as the final form of off-site polyactivity. Celling splits traffic into different IDCs at the traffic ingress, each of which undertakes its own traffic, and the traffic before the IDC does not call each other. Celling has nothing to do with the region, in theory, after the unitization, the new traffic can be completely solved by the new IDC, and the new IDC will not be limited by the region, because IDC will not have traffic calling each other before. 

To judge whether to achieve unitization, I understand that there is only one criterion, that is, whether the flow can be self-closed. For example, if your A computer room is deployed in Shanghai, and any place in the B computer room is far overseas, it has no impact on the business (the geographical distance of the AB computer room is very long, if the self-closing loop cannot be achieved, the RT called to each other will become longer, which will inevitably affect the business), then you can think that the unitization is successful.

The cellularized flow is as follows:

Celling to do active-active only needs to be synchronized at the underlying data level. This is shown below:

Active-active traffic routing rules

Routing mode

1. Random routing

Randomly route traffic to the respective IDCs on a pro-rata basis, simply to each IDC proportionally without any rules. In the event of a failure, you can switch the traffic of the failed room to a different IDC.

2. User id routing

Traffic is routed to the respective IDC in a certain proportion according to the user ID, and each user’s actions are routed to the specified IDC. In the event of a failure, the traffic of the failed room can be switched to another IDC according to the user.

3. Regional routing

Traffic is routed to the respective IDC in a certain proportion according to the city to which the user belongs, and the user actions in each place are routed to the specified IDC. In the event of a failure, you can switch the traffic region of the failed room to a different IDC.

The option of four-wheel travel

After many discussions, Hello four-wheel travel finally chose to route according to the region.

The main reasons are as follows: random routing is not a good solution in all kinds of multi-active designs, mainly because random routing cannot be achieved in multi-active projects due to its irregularity. Choosing a regional route instead of a user dimension routing, mainly due to the four rounds of business and e-commerce business There are some differences, the basic operations in the e-commerce business are based on C-end users, each C-end user only operates their own order data, so the order data is naturally isolated according to the user id, unitization is also easier to do, but this scheme is also at the expense of the B-side business experience, the merchant operation of multi-user order data will inevitably exist across the computer room possibility, thereby affecting the business experience. 

As a four-wheeler, the buyer and seller are passengers and drivers, which is a natural dual order model (driver order and passenger order), so if the user id is routed, there will inevitably be a large number of cross-computer room routes (drivers and passengers are assigned to different machine rooms because of different user IDs) in the interface of the driver order and the passenger at the same time (such as driver orders). If it is divided according to the region, because the probability of travel orders across the city or across provinces is extremely low, the cross-computer room rate will be greatly reduced, and the number of cross-computer rooms can be artificially reduced according to the actual proportion of cross-computer rooms. However, the scheme also has certain defects, because regional orders will have relatively large traffic fluctuations in different times and scenarios, such as: holidays, rainy days, etc., resulting in large fluctuations in traffic in the same city and province, resulting in inconsistent pressure in the computer room.

Active-active program

Intermediate schemes

The capabilities provided by active-active middleware are mainly divided into four categories, storage, messaging, soa, and snowflake algorithms.

storage

Storage provides much of the power for the underlying data to be synchronized in both directions. 

Redis’s cross-room read-write and cross-room locks are both due to the dual-order model can not be unitized, providing dual-active capabilities. 

Redis doublewrite is temporary capability without bidirectional synchronization capability. 

db correction is the DB level to specify the route, but also for the bottom, when the soa route out of the exception, in the db layer to do the last bottom of the pocket (can access cross-room data). 

db write ban protection is when the service is turned on write ban protection, and orders that are not in the local computer room cannot be operated in this order, which is also a kind of db bottom protection.

message

Messages are divided into sending and consuming. 

For sending, cell-to-center replication and center-to-unit replication are both copies of one room message to another computer room, which is also a solution that cannot be unitized. Of course, this solution can also be compatible with active-active applications sending messages and non-active-active application consumer problems. 

Consumption of local computer room: Only the messages generated by the local computer room can be consumed, and the copied messages of the remote computer room cannot be consumed. 

No consumption: It means that the local computer room and the off-site computer room are not consumed. 

The consumption of local computer room and off-site computer room news represents the news consumption of local computer room and off-site computer room news (consumption of double messages).

soa

The SOA interface needs to route rpc requests to the normal machine room according to specific conditions. 

A service provider route indicates that the routing rule is specified by the service provider.

Service consumer routing means that the routing rule is specified by the service consumer.

Snowflake algorithm

Since the zk cluster is two computer rooms, it is necessary to mark the machine room logo on the snowflake algorithm to ensure global uniqueness.

Business transformation

Most of the business transformation is based on the middleware solution. Some of these solutions are based on some dual-life transformation points carried out by the business itself.

Unitization transformation: Some business logic cannot be unitized in the case of a single computer room before, and it is necessary to transform some unitable services.

db cache consistency: Based on the order reliability guarantee, the order data is guaranteed by binlog messages to ensure the consistency of data between the two machine rooms.

Machine room filtering transformation: based on the computer room information, processing non-native room logic, or filtering non-native room logic (currently unable to unitize the temporary solution).

Single number transformation: When issuing a single order, you need to hit the routing rule on the order number, and route according to the order number when modifying other attributes of the order (guarantee unitization).