Today’s sharing I will mainly include these four parts:

The first part first introduces the background of horizontal expansion, mainly introduces why horizontal expansion, mainly compares with vertical expansion, and talks about the problems that we generally encounter in horizontal expansion.

The second part will briefly introduce how TDSQL does horizontal expansion, so that everyone can have an intuitive impression.

The third part will introduce the design principle behind TDSQL horizontal expansion in detail, mainly corresponding to the first part, to see how TDSQL solves the problems encountered in general horizontal expansion.

Part IV presents examples in practice.

First, let’s look at the background of horizontal scaling. The reason for the expansion is actually very intuitive, generally speaking, it is mainly with the traffic of the service, or the scale of the need to expand, and the existing capacity or performance can not meet the needs of the service, mainly in the TPS, QPS is not enough or the delay exceeds the tolerance range of the service, or the existing capacity can not meet the requirements, the latter mainly refers to the disk or network bandwidth. Generally encounter this kind of problem, we have to expand. In terms of expansion, in fact, the more common way is two ways, one is vertical expansion, and the other is horizontal expansion. These two have different characteristics, and the advantages and disadvantages are actually very obvious.

First, let’s look at vertical scaling. Vertical expansion is mainly to improve the configuration of the machine or improve the configuration of the instance. Because, we know that everyone buys a database or an instance on the cloud, in fact, it is allocated on demand, that is, for users, the current business volume may not be large, only need two CPUs or a few gigabytes of memory; As the business grows, he may need to expand the capacity of this instance, then he may currently need 20 CPUs, or 40G of memory.

At this time, in the cloud, we can dynamically adjust the control of resources to meet the needs of the business – that is, we can dynamically increase the CPU on the same machine. The limit of this expansion is that when the entire machine’s CPU and memory are given to it, if it is found that it is not enough, it is necessary to prepare a better machine for expansion. This can be switched between master and standby in MySQL: by first selecting a standby, and then synchronizing data; After the data synchronization is completed, the main and standby switches are carried out, so that the machine that is better now can be used.

As you can see, in this whole process, there is basically no impact on the business – the main and standby switching, if you change the IP, in fact, through the front-end or VIP way, basically no impact on the business. Then one of its biggest disadvantages is that it relies on stand-alone resources: you can provide it with a better machine to meet a certain amount of requirements. As the business develops more rapidly, you will find that the best machines you can provide now may not be satisfied, which is equivalent to being unable to expand. Therefore, the biggest disadvantage of vertical expansion is that it relies on the resources of a single machine.

Compared with vertical expansion, another way we call horizontal expansion. The biggest advantage of horizontal expansion is that it solves the problem of vertical expansion – in theory, horizontal expansion can be infinitely expanded, and it can dynamically adapt to the needs of the service by adding machines.

Horizontal expansion can solve the problem of vertical expansion compared with vertical expansion, but it will introduce some other problems. Because horizontal expansion is more complex than vertical expansion, let’s analyze the problems we may encounter below, and later we will introduce TDSQL solutions:

First of all, in the vertical expansion, after the system is expanded, in fact, there is still a node in the data, and in a main and multi-standby architecture, all data is also stored on the standby. The data will be split during horizontal expansion, and the first question is how to split the data? Because if the split is not good, when there is hot data, the possible result is that even if the data has been split into many parts, the separate node that stores the hot data will become a performance bottleneck.

Second, in the entire horizontal expansion process, it will involve data relocation and routing changes. So can you be unaware of the business in the whole process? Or is it about how intrusive it is to the business?

Third, in the whole expansion process, because there are so many steps just now, if one of them fails, how can you roll back? At the same time, in the whole expansion process, how can we ensure high data consistency during the switchover process?

Moreover, after expansion, because the data is split to each node, how can the performance after expansion be guaranteed? Because in theory, we want me to increase the performance linearly as the machine increases, which is ideal. In fact, in the whole process of horizontal expansion, different architectures or different ways have a relatively large impact on performance. Sometimes you will find that there may be a lot of expansion, the machine has increased, but the performance is difficult to scale linearly.

Similarly, when the data has been split into multiple parts, how can we continue to ensure the distributed nature of the database? In a stand-alone architecture, a copy of the data is stored, similar to MySQL’s support for local atomicity—guaranteeing that the data in one thing will either all succeed or all fail. In a distributed architecture, atomicity can only guarantee that the data is consistent at a single point. Therefore, from a global perspective, since the data is now cross-node, how to ensure global consistency in the cross-node process, and how to ensure that the data is either written successfully or rolled back on multiple nodes? This will involve distributed transactions.

So as you can see, the advantages of horizontal expansion are obvious, it solves the limitation of vertical expansion machines. But it is more complex and introduces more problems. With these questions in mind, I’ll introduce how TDSQL scales horizontally and how it solves the problems just mentioned.

First, let’s look at the architecture of TDSQL. TDSQL simply contains several parts:

The first part is the SQL engine layer: mainly as an access side, masking the data storage details of the entire TDSQL backend. For the business, the business accesses the SQL engine layer.

This is followed by a data storage layer consisting of multiple SETs: in a distributed database, where data is stored on individual nodes, each SET we treat as a unit of data. It can be one main and two standbys or one main and multiple standby, which is deployed according to business needs. Some business scenarios have high requirements for data security, and can be one main and three backups or one main and four backups. This is data storage.

There is also a Scheduler module, which is mainly responsible for the monitoring and control of the entire system cluster. When the system is expanded or switched between main and standby, the Scheduler module is equivalent to the control module of the brain of the entire system. For the business, it only pays attention to the SQL engine layer, does not need to pay attention to Scheduler, does not need to pay attention to how the data crosses nodes, how many nodes are divided into nodes, etc., which is unaware of the business.

You can take a look at the entire expansion process: at the beginning, the data is placed on a Set, that is, in a node. Then the expansion will actually expand the data to – there are 256 Sets, which will be expanded to 256 machines. As you can see in the whole expansion, there are several key points:

At first, although the data was on a node and on a machine, the data had actually been split, and the example shown in the illustration was that it had been split into 256 copies.

Horizontal scaling, in simple terms, is to migrate these shards to other sets, that is, other node machines, so that you can increase the number of machines to provide system performance.

To sum up, the data has been divided at the beginning, the expansion process is equivalent to moving the shard to a new node, the entire expansion process, the number of nodes is increased, from 1 to 2 to 3, and even expanded to 256, but the number of shards is unchanged. If 256 shards are initially expanded to two nodes at the beginning, it is possible that every 128 shards are on one node; Expand to the end, you can expand to 256 nodes, the data is in 256 machines, each machine is responsible for one of the shards. Therefore, the entire expansion is simply to move the shard. We’ll talk about the details later.

On the private or public cloud, there is a unified front page for the entire expansion TDSQL, which is very convenient for users in the process of use.

Let’s look at this example. Now there are two Sets in this case, that is, two nodes, each node is responsible for a part of the route, the first node is responsible for 0-31, the other name is 3, and the responsible routing information is 32-63. Now there are two nodes, if you want to expand, on the foreground page we will have a “Add Set” button, click “Add Set”, will pop up a dialog box, which by default will automatically select the previous configuration, users can customize themselves, including the current Set, how many resources are needed and memory, disk allocation, etc.

In addition, because the expansion requires route switching, we can manually select a time, which can be automatically switched, or the actual situation of the service can be judged by the service, and the route switch can be manually operated. These can be set up according to the needs of the business.

After the first step is created, it is said that the brain module will be responsible for allocating various resources, as well as initializing, and the entire logic of data synchronization. Finally, you will see that the first node, which used to be two nodes, has now become three nodes. Before the expansion, the first node was responsible for 0-31, now it is only responsible for 0-15, and the other part of the route is responsible for the new node. So the whole process, you can see, through the click on the web page can quickly add two nodes to three nodes – we can also continue to add Set, continue to expand the capacity according to the needs of the business.

Just introduced the core architecture of TDSQL, as well as the characteristics of horizontal expansion and foreground operation, (help) everyone to establish an intuitive impression.

In the third chapter, I will introduce the design principle behind TDSQL horizontal expansion in detail, mainly to see some of the problems encountered in the horizontal expansion mentioned in the first chapter, and how we solve these problems. These problems need to be solved regardless of which system is scaled horizontally.

First of all, we just mentioned that the first problem of horizontal scaling is how to split the data. Because data splitting is the first step, this will affect the entire usage process that follows. For TDSQL, the logic of data splitting is placed in a syntax for creating a table. The service needs to specify that the shardkey is “equal to a field” – the business needs to select a field as the partition key when designing the table structure, so that TDSQL will split the data according to this partition key, and if it accesses, it will aggregate the data according to the partition key. We want the business to be involved in designing the table structure by specifying a field as the shardkey. In this way, compatibility and performance can be well balanced.

In fact, we can also do not specify shardkey when the user creates the table, and randomly select a key from our bottom side to do the data splitting, but this will affect the subsequent use efficiency, such as not being able to play a particularly good role in the performance of the distributed database. We believe that a small amount of involvement at the business layer in the design of the table structure can bring a very large performance advantage, balancing compatibility and performance. In addition, if the business chooses shardkey – partition key, when the business designs the table structure, we can see multiple tables, you can choose the relevant column as the shardkey, which can ensure that when the data is split, the relevant data is placed on the same node, which can avoid cross-node data interaction in many distributed situations.

When we create tables, partitioned tables are the ones we use most often, splitting the data into individual nodes. In addition, in fact, we provide two other – a total of three types of tables will be provided, the main thinking behind it is for performance, that is, by the global table of this kind of data is a full amount of tables on each node – at the beginning we will see that the data is all on each node, it is equivalent to no distributed features, no horizontal split characteristics, but in fact, this kind of table, we will generally use in the data volume is relatively small, less change in some configuration tables, through the redundancy of data to ensure subsequent access, In particular, data interaction across nodes can be avoided as much as possible when operating. In other ways, shardkey said, we will make a hash according to the user, this advantage is that our data will be more evenly distributed across the nodes, to ensure that the data will not have hot spots.

As mentioned earlier, because the entire expansion process will be more complicated, can the entire expansion process ensure high availability or high reliability, and what is the perception of the business, and how does TDSQL do it?

Data synchronization

The first step is the data synchronization phase. Suppose we now have two Sets, and then we find that one of the SET disk capacity is now more dangerous, for example, it may reach more than 80%, at this time to expand it, we will first create a new instance, through the copy image, create a new instance, create a new synchronization relationship. The process of establishing synchronization is business-agnostic, and this process is real-time synchronization.

Data validation

The second stage is to continuously level the data and continuously check the data. This process may last for a while, for the delay difference between the two synchronizations is infinitely close – for example, we set a threshold of 5 seconds, and when we find that we have been chased within 5 seconds, this time we will enter the third stage – the route update phase.

Route updates

In the routing update stage, first of all, we will freeze the write request, at this time, if the service has written it, we will reject it, let the business retry after two seconds, this time there is actually a second-level impact on the business. But this time will be very short, after freezing the write request, the third instance synchronization will soon find that the data will catch up, and the check is no problem, this time we will modify the route, while performing related atomic operations, in the bottom layer partition shielding, so as to ensure that the SQL access layer in the case of the route is too late to update the data will not be written wrong. Because the underlying layer has changed, the partition has been shielded. This ensures data consistency. Once the route is updated, the third SET can receive the user’s request, and at this time you can find that the first SET and the third SET have full data because the synchronization is established.

Delete redundant data

The final step requires the deletion of this redundant data. Deleting redundant data is deferred to ensure that the deletion process can be slowly deleted, and it will not cause relatively large IO fluctuations and affect the business of the live network. During the whole deletion process, we do partition masking, and at the same time, we will do SQL rewriting at the SQL engine layer to ensure that although we have redundant data at the bottom, even if it is a full scan when the user comes to check, we can ensure that there will be no more data. You can see the entire expansion process, data synchronization, as well as checksum and delete redundancy these stages, the time consumption will be relatively long, because to build synchronization, if the amount of data is relatively large, the entire copy image or chase binlog This period of time is relatively long. However, these stages have no impact on the business, and the business does not perceive that a new synchronization relationship has been added to the present. Then, if a problem is found when establishing a synchronization relationship, or if there is a problem when creating a new standby, you can completely change another standby, or after a retry, this has no impact on the business. In the routing update stage, in theory, it is difficult to avoid the second-level impact on the service write request, but we will control the impact time window to a very short time, because the freeze write request itself is necessary to ensure that the synchronization has been within 5 seconds such a relatively small threshold, and after synchronization to this stage, we can initiate the route update operation. At the same time, we have partitioned masking the storage layer to ensure that there will be no data confusion between multiple modules if there is an update at the same time. This is a principle process in which we ensure high availability and high reliability in scaling, and the entire scaling has very little impact on the business.

I just talked about the general process of the expansion phase and how TDSQL solves the problem. Next, let’s look at how to solve the problems brought about by the horizontal expansion just mentioned after the expansion is completed. The first is distributed transactions.

Atomicity, decentralization, linear growth in performance

After expansion, the data is cross-node, the system originally had only one node, now cross-node, how to ensure the atomicity of the data, this we based on two-stage commit, and then realized distributed transactions. The entire processing logic completely masks the complexity behind the business, and the use of distributed databases for the business is the same as the use of stand-alone MySQL. If the SQL of the business only accesses one node, then ordinary transactions can be used; If we find that a user’s SQL or a transaction operates on multiple nodes, we commit it in two stages. In the end, the atomicity of the entire distributed transaction is guaranteed by logging. At the same time, we have completely decentralized the entire distributed transaction in the implementation process, and we can do TM through multiple SQLs, and the performance can also achieve linear growth. In addition, we have done a lot of various exception verification mechanisms, with very robust exception handling and global trial-and-error mechanisms, and we have also passed the TPCC standard verification.

For horizontal scaling, after the data is split into multiple nodes, it mainly brings two problems: one is the problem of transaction atomicity just said, which is solved by distributed transactions; And then there’s performance.

In vertical expansion, it is generally possible to achieve a linear increase in performance by replacing a better CPU or a similar method. In terms of horizontal expansion, because the data is split to multiple nodes, how to make good use of the nodes that are split down, carry out parallel computing, and truly play the advantages of the horizontal distributed database, which requires a lot of operations and a lot of optimization measures. TDSQL has made some optimizations like this.

First, the relevant data exists on the same node. When building the table structure, we hope that the business can participate in a part, and specify some relevant keys as shardkeys when designing the table structure, so that we can ensure that the relevant data of the backend is on a node. If you make a union query on this data, you do not need to cross nodes.

Similarly, we achieve performance improvement through parallel computing and streaming aggregation – we split and distribute SQL to each node in the background, and then compose in parallel with each node, and then do secondary aggregation through the SQL engine after the calculation is done, and then return it to the user. And in order to reduce the pull of data from the backend to SQL, if we reduce the pull of data, we will do some pushdown queries – push more conditions onto the DB. In addition, we have also done data redundancy to minimize cross-node data interaction through data redundancy guarantees.

Let’s simply look at an aggregation – how TDSQL achieves horizontal scaling, basically has no perception of the business, and the use method is the same as using stand-alone MySQL. For the business, assuming that there are 7 pieces of data, the business does not care whether there is one node or multiple nodes in the specific data of this table, and only needs to insert 7 pieces of data. The system will parse the syntax according to the SQL transmitted and automatically rewrite this data. 7 pieces of data, the system will calculate according to the partition key, find that these 4 are to be sent to the first node, the other 3 are sent to the second node, and then overwritten, and then inserted after the overwriting. For the user, it is to execute such a one, but across nodes, we will use two-stage commits on our side, which will become multiple SQLs, and then ensure that once there is a problem, both sides will roll back at the same time.


After the data is inserted, if the user wants to make some queries – in fact, the user does not know that the data is split, for him it is a complete table, he uses similar aggregate functions and so on to query. Similarly, this SQL will be rewritten, and the system will send this SQL to both nodes, and add some averaging functions to perform the corresponding transformations. At each node, the system will first do data aggregation, and then do aggregation again here. The advantage of adding this step is that if we come over here, we can avoid consuming too much memory at one time by making an aggregation, which is equivalent to not having to cache too much data here, and doing a streaming calculation.

For some of the more complex SQL, such as multiple tables or more subqueries, if you are interested, you can pay attention to our later sharing – SQL engine architecture and engine query practice.

In the third chapter above, we introduce some of the principles of TDSQL’s entire horizontal expansion in more detail, such as how to split the data, horizontal expansion practices, and how to solve problems in the expansion process, and also introduce some problems brought about by horizontal expansion, how TDSQL solves.

In the fourth chapter, we will briefly introduce some practices and cases.

Just now we said that we hope that when creating a table, the business can consider the choice of partition key when it participates in the table structure design. How do I choose a partition key? Here are a few types to briefly introduce them.

If it is an Internet application for users, we can use the user’s corresponding fields, such as user ID, to make the partition key. This ensures that when there are a large number of users, the data can be split into individual backend nodes according to the user ID.

Game applications, the logical body of the business is the player, we can pass the corresponding field of the player; For e-commerce applications, it can be used as a partition key according to some fields of buyers or sellers. The Internet of Things can be used by, for example, the ID of the device as the partition key. Choosing the partition key is generally to split the data well to avoid leakage points at the end. That is to say, selecting this field through this partition key can make the data more evenly dispersed to each node. In terms of access, when there are more SQL requests, it is actually with a partition key condition. Because only in this case, can we better play the advantages of distribution – if it is a partition key in the condition, then this SQL can be directly entered into a node; If there is no partition key, it means that this SQL needs to be sent to all nodes on the backend.

As you can see, if the horizontal expansion to more – from one node to 256 nodes, then a certain SQL is not written well, you may need to do the aggregation of all the data of 256 nodes, then the performance will not be very good.

To sum up, we want the business to be as involved as possible in creating tables and designing the table structure. Because whether it is an aggregate function or the operation of various transactions, in fact, the business is basically unaware, and the participation of the business at this time means that it can be exchanged for a great performance improvement.

When are we going to scale? In TDSQL, we will have a large amount of monitoring data, for each module we will monitor the operation status of the entire system locally, and there will be various log reporting information on the machine. Based on this information, we can decide when to scale.

In simple terms, such as disks – if you find that the data disk usage is too high, you can expand the capacity at this time; Or SQL requests, or CPU usage is close to 100% – now it basically needs to be expanded if it reaches 80% usage. There is also a situation that may be that the amount of requests at this time is actually relatively small, and the use of resources is relatively sufficient, but if the business tells you in advance that an activity will be carried out at some point, the amount of requests will increase several times at that time, and we can also complete the expansion in advance.

Let’s look at a few more examples of clusters on the cloud. As you can see, this cluster has 4 sets, each SET is responsible for a part of the shardkey, this routing information is 0-127, which means that it can finally expand to 128 nodes, so it can be expanded by 128 times. This “128” can be set by the initialized business estimate. Because if the pool is too large, it can indeed be expanded to several thousand units in the end, but the data will be scattered. In fact, every machine on the cloud today or in practice is already very good and does not require thousands of specifications.

This is another cluster – it will have a little more nodes, with 8 nodes, each of which is also responsible for a portion of the routing information. This number is only 64, so this can eventually be expanded to 64 nodes. This is a related example on the cloud.

Today’s sharing is mainly these contents, if you have any questions, please feel free to comment and leave a message.

Q: The tables in SET before the expansion are all partition tables, so why not partition tables?

A: Yes, before the expansion, it is equivalent to this, simply say that we have a node now, then we tell him 256, this value we set when we initialize. And this value will not change after the cluster is initialized. Suppose our cluster sets a value of 256 – because he may think that this amount of data will be very, very large later, and can be set to 256. At this time, the data is on one node. At this time, the user has created a table according to the syntax we just had, and this table is actually divided into 256 parts at the bottom. So even if he doesn’t expand it, its data is 256 copies. Create another table, which is also 256 copies. You may create two tables, but we have 256 partitions on the bottom layer of each table, and expanding the capacity is equivalent to moving the partitions to other places.

Q: How can the backup files of each node be restored to ensure consistency with each other?

A: There is no mutual relationship between the nodes, each node is responsible for a part of the routing number segment, only part of the data is stored, horizontal expansion is only responsible for a part of the data, the backup between them is actually not related to each other, so this backup is actually unrelated. Each node we may have a master and two standby, this is actually we have a strong synchronization mechanism, in the replication time to ensure strong consistency of data. You can refer to the previous sharing, which will introduce in more detail “TDSQL in a single node, TDSQL one main and multiple standby architecture how to ensure strong consistency of data.”

Q: Can the two phases avoid a single point of failure when coordinating?

A: First of all, when committing in two stages, we use the SQL engine to coordinate transactions, which is a single transaction. If other connections are sent over, you can use other SQL engines for transaction coordination. And each SQL engine is stateless and can be scaled horizontally. So this is actually not too many failures, we can randomly expand according to performance, can do linear growth of performance, no centralization. The logs are scattered, and the logs will also be recorded in the data nodes of the TDSQL backend, one main and multiple backups, internal to ensure strong consistency, there will be no single point of failure.