Click on the top of “zhisheng” to follow, star or pin to grow together
Flink goes from beginner to proficient Series of articles
Many developers are familiar with MVCC (Multi-Version Concurrency Control) in InnoDB. At the application level, by maintaining multiple versions of data, the number of parallel transactions can be increased without affecting the seriality of each transaction. This paper Paper Reading paper from VLDB 2017: “An Empirical Evaluation of In-Memory Multi-Version Concurrency Control”, which combs through the different implementations of MVCC mainstream and gives a test comparison of different implementations in OLTP scenarios at the end. This paper focuses on the paper describing several key implementation keys of MVCC: concurrency control protocol, version data storage mode, garbage cleaning mechanism, and index management, and how MVCC implementations differ from InnoDB compared to other DBMS implementations.
The author of this article: jiekun, authorized forwarding
https://jiekun.dev/posts/mvcc/ The Paper Reading series aims to share papers published at prestigious conferences. VLDB (Very Large Data Bases) is one of the three major conferences in the database industry, the other two being ICDE (The International Council for Open and Distance Education) and SIGMOD.
About the paper
An Empirical Evaluation of In-Memory Multi-Version Concurrency Control” from VLDB 2017, The main contribution is to completely sort out the mainstream MVCC technical points, and implement all Approaches in the self-developed Peloton database for horizontal test comparison of fixed variables.
Interestingly, the title was not used when the paper was originally submitted. According to Andrew Pavlo (one of the authors of the paper, CMU Associate Professor), the review feedback they received when they submitted their paper to VLDB was very positive, except for a request to change the title at the time: “This is the Best Paper Ever on In-Memory Multi-Version Concurrency Control”.
But the authors still wanted to be triumphant about the title of the paper, so they resubmitted these titles twice in the following picture.
Fig 1: About The Paper
of course, these subjective titles could not be published, and in the end they had to change to the titles they see now.
(Multi-Version Concurrency Control) literally contains two aspects
Multi-versioning: generate multiple versions of data content, So that reads and writes can not block each other.
Concurrency Control: Concurrency control, which allows parallel execution to maintain serialized results.
For example, for SQL:
SELECT a FROM tab_1 where id = 1; , transactions T1 and T3 have access to different versions of the data results:
------------------------------------------------------------------------------------------------------------------+| | T1 | T2 | T3 || ---|-----------------------------------|--------------------------------------|-----------------------------------|| 1 | BEGIN; | | || 2 | | BEGIN; | || 3 | | UPDATE tab_1 SET a = 2 WHERE id = 1; | || 4 | | COMMIT; | || 5 | | | BEGIN; || 6 | SELECT a FROM tab_1 WHERE id = 1; | | SELECT a FROM tab_1 WHERE id = 1; |+------------------------------------------------------------------------------------------------------------------+
T1 is opened before T2 and T3 is opened after T2 modifies the record, so T1 reads the version before the modification, and T3 reads the version after commit.
The purpose of MVCC is to allow different transactions to run under their own snapshot version data without blocking each other: a write operation can generate a new version of the data, while a read operation can still read the old version. At present, many mainstream DBMS support MVCC, such as: MySQL, Microsoft SQL Server, Oracle, Hekaton, etc.
Although it has
been around for a long time and has a lot of database support, MVCC does not have a “standard” implementation. Different databases have their own trade-offs in MVCC design, and different designs also have performance differences in different scenarios.
The implementation of MVCC can be mainly divided into four key module designs:
we will introduce a variety of designs and implementations of different modules one by one later. But first, let’s understand what kind of metadata needs to be added to DMBS in order to support MVCC.
Version data forms
from the transaction dimension and data row (tuple) dimension, in order to support MVCC, need to introduce a specific metadata.
For transactions, the DBMS needs to provide a unique identifier to distinguish between transactions. This identifier is usually a monotonically increasing timestamp (which can be a ordinal number of 0, 1, 2, or a real time), which we call the ID of the transaction. With
TIDs, DMBS can use them to mark versions of data rows.
For data rows, the DBMS requires the following 4 fields to locate in concurrent transactions
txn-id:transaction ID, and can be used as a write lock
end-ts: represent the lifecycle
: a pointer to an adjacent (new/old) version of the same row of data, relying on the
pointer, the version data can form a singly linked list
pointer of the version data
Fig 2: MVCC Metadata Sample
MV in MVCC is implemented through these metadata of version data rows, if we simplify the model a little, as long as
there are begin-ts and end-ts recorded, we can compare them with the
TID of the current transaction
Determine which version row is visible to the transaction. The implementation of CC can actually exist independently of the MV, such as maintaining the seriality of transactions through Timestamp Ordering or Two-phase Locking.
Fig 3: MVCC Tuple Linked List
Concurrency Control Protocol
Every DBMS system needs to have a concurrency control protocol, which needs to do two things:
determine whether a particular transaction can access or modify a specific version of the data
Determining whether a particular transaction can commit its changes
Note that through
this set of protocols, MVCC solves the problem of execution order and results between different transactions (rather than multi-version data), that is, through locks or chronological order Maintain seriality. Two protocols will be introduced in this section: MVTO and MV2PL, MVOCC and Serialization Certifier can be further explored in the paper.
Timestamp Ordering (MVTO)MVTO
designed to rely on chronological order to ensure the seriality of transactions to execute the result:
write locks are still implemented by
txn-id, as long as they
are txn-idis not 0 or is not the transaction’s own TID, indicating that another transaction has made changes, and the second transaction that wants to make changes will terminate
lock through the time order guarantee, and the concurrent read scenario will not block each other, but will read the maximum
TIDof this version of the data row To record, if a future transaction reads (although it has already happened, that is, it happened in the past), then the current transaction will not be able to modify this line of data (no new version data line can be added)
key to MVTO implementation is to calculate the order of different transactions with the help of the transaction’s unique identifier (
TID, that is, timestamp).
In this implementation, the version data row introduces a
read-ts field in addition to the Metadata field mentioned above, which records the
TID of the last transaction that read it.
When transaction T reads logical data row A, DBMS needs to search for a suitable version of data according to the current TID, so that
the TID falls between begin-ts and end-ts.
At the same time, the data that transaction T can read must be
txn-id 0 or
TID, which means that the data line is not write-locked by other transactions, because the MVTO does not allow the transaction to read uncommitted changes. When reading the version data Ax of A, if the
read-ts of Ax is less than
TID, modify the field to
Fig 4: MVCC Timestamp Ordering Concurrency Control
In MVTO, the result of a transaction updating a data row always produces the latest version of that data row. Transaction T can update version Bx when the following conditions
Bxversion data is not write-locked
byother transactions Must be greater than or equal to
Bx'sread-ts, which means that when the transaction that does not read the line
of data after the TID is
updated, transaction T will add a new version of the data
Bx+1, and its
txn-id is equal to
TID 。 When transaction T commits, the begin-ts of
Bx+1 are set to TID, end-ts are set to
INF, and the
end-ts of the previous version data
Bx is set to
Two-phase locking (
MV2PL still uses
begin-ts and end-ts to decide whether a release record is visible.
Whether it is readable and writable is controlled by a read-write lock.
MV2PL uses read-cnt as a read lock, and when the corresponding version row data is found, the version data can be read locked by adding 1 to
At the same time,
read-cnt cannot be incremented when the version data holds a write lock.
Fig 5: MVCC Two-phase locking Concurrency Control
is used as a write lock, which determines whether version data can be modified:
- when the
theversion data is 0 or current
MVTO and MV2PL data row structures are very similar, and only read-ts and
read-cnt differ in the above brief.
Due to the existence of
read-ts rules, the read of transactions that start in the future will prevent the writing of transactions started earlier, so the concurrent read and write rules of MVTO are controlled in chronological order, although it has the performance of locks, but the essence is the embodiment of chronological order.
read-cnt describes the number of read locks on the current version of the data row, and does not care whether these read locks are from future transactions or previous transactions. Therefore
, read-cnt can be regarded as a shared lock, which is fundamentally different from MVTO.
Version data is stored
in MVCC, and each update creates a new version data. Pointers in version data rows are responsible for pointing to the previous version of the data or the data of the next version, thus forming a one-way chain of versions. Version chains cannot be doubly linked lists, because the modification of doubly linked lists needs to be locked and cannot be done with the help of atomic instructions.
There are many mainstream versions of data storage schemas, but it can be revealed in advance that Delta storage is the optimal solution, InnoDB uses Delta storage, Postgres uses Append-only storage, and very few DBMS use time-travel storage. We will go through them one by one.
Append-only stores different versions of all data rows (including the master version) in the same space (e.g. the same table). Whenever logical data is updated, DMBS first requests a space for a data row in the table, then copies the latest version of the data into the new data row space, and finally applies the changes to this row.
Fig 6: MVCC Append-only Storage
said earlier that because of the lock relationship, there is no way to maintain the version data of a doubly linked list, so the direction of the version data becomes very important:
if the version data is arranged from old to new (O2N), Then every time you get a newer version of the data (which is the case in most scenarios), you need to traverse the entire
is arranged from newest to oldest (N2O), then the starting point of the linked list will change every time a new version data is inserted
linked list If the version data
In the O2N scheme, due to the existence of useless version traversal, the performance of this scheme is highly dependent on the garbage collection mechanism of version data, if garbage collection can maintain the singly linked list at a short length, then the performance of the query can be improved, otherwise it will be time-consuming.
In the N2O scheme, there is also a way to reduce the number of changes to the starting point, that is, to
use a mapping entry to represent the starting point of the linked list, so that when the new version of the data is generated, only the address pointed to by the entry needs to be changed, and the index pointing to the entry can not change. This optimization improves well on tables with very many secondary indexes, but requires additional storage space.
Fig 7: MVCC O2N vs. N2O
Because Append-only stores complete data rows, even if only a few fields in the data row have changed. This behavior leads to the introduction of a large amount of duplicate and large volume of data when the table has non-inline data (that is, the data is not recorded in tuple, such as BLOB, TEXT). One of the optimization methods is to allow different versions of data rows to point to the same BLOB and TEXT objects, and add
cnt metadata to record the number of references, which is convenient for GC processing.
and Append-only store is similar, version data is a linked list record data row, but historical version data is separated from the latest version data storage space. On the main table, only the latest master version data is available, while the linked list of historical data is stored in a “Time-Travel” table.
When updating logical rows, DBMS copies the master version of the data into the “Time-Travel” table, and then updates the data rows in the main table in place to form a new version of the data.
Fig 8: MVCC Time-Travel Storage
Time-Travel is designed to avoid frequent updates of pointers on the index because they always point to the master version data on the primary table, and the data is updated in place without the address changing.
However, because version data is stored in full tuple, there will also be problems with non-inline data, and it can also be optimized by sharing objects.
store introduced in Delta also maintains only the master version of the data on the main table, and then stores the version data in an additional “Delta” space. The “Delta” space in MySQL InnoDB and Oracle refers to the data segment used for roadback, such as the undo log in InnoDB.
When updating logical data rows, DMBS also requests the location of the “delta” space first, and then writes the old version of the changed attribute data to it, rather than the full tuple row. Finally, DMBS updates the master version data in place on the master table.
Fig 9: MVCC Delta Storage Delta
stores very well when recording version data, reducing memory allocation when UPDATE only operates on a subset of data rows, and there is no problem with the large size of external data versions. However, in the heavy-read scenario, in order to obtain the corresponding version data, DMBS must obtain the corresponding version of the field values from the version data chain of each field, and then piece them together, which brings certain additional overhead.
, it can be seen that different version data storage schemes have their own adaptive scenarios. For example, Append-only storage is better in some data analysis scenarios because version data is stored in contiguous space, which can improve the cache hit rate when scanning a wide range, and can also be optimized with hardware read-ahead. In addition, the scheme of storing complete data rows (Append-only, Time-Travel) can expose the physical version data to the index management system, providing more optional solutions for index management, while there are no physical version data rows in Delta storage, which is at a disadvantage in the above aspects.
advantage of garbage cleaning mechanisms constantly creating version data is that DBMS can use them to achieve “Time Travel” if it is not cleaned up, which means that you can access snapshots of data at any time. Postgre had done this in previous versions, but dropped support in the new version when they realized that there weren’t any scenarios that needed this functionality.
There are many disadvantages to the accumulation of version data, including huge storage overhead, extremely high version chain traversal overhead (also depending on the direction of the version chain and the use case). So naturally there needs to be a GC operation to free up this part of the space.
GC can be divided into 3 steps:
detect outdated versions
and disconnect them (remove linked list elements) in the linked list to
free up space
There are many ways to detect stale version data, such as detecting if a version row was created by an unusual transaction, or checking if a version row is not visible to all active transactions. For the latter, this can be achieved by comparing the
end-ts of the version row with the
TID of the active transaction. DBMS can usually store transaction information centrally, but on multi-core systems this can be a scalability bottleneck.
With detection methods in place, DMBS can look for these obsolete versions from different dimensions, such as starting from version data rows and starting from transactions.
paper introduces GC methods for two data row dimensions, VAC and COOP.
Background Vacuum (VAC) uses a background thread that periodically scans the database for outdated versions. The simplest implementation of this solution is to traverse the version of the linked list, but the GC performance of this is difficult to improve synchronously as the amount of data grows, and we need to reduce the purposeless traversal, or make the traversal range narrow. One optimization is for DBMS to maintain a bitmap that maps blocks containing changed data so that the background vacuum thread can skip blocks that have not changed since the last GC.
Fig 10: MVCC Tuple-level
Cooperative Cleaning (COOP) is based on using worker threads for GC instead. When looking for version data, the worker thread also needs to follow the version linked list to traverse, and if outdated version data is found during the traversal process, it will be cleaned up directly. There are two problems with this scheme:
There is a “dusty corners” problem, GC is associated with the query task, so if there are multiple versions of the logical data created, no queries have occurred, then these versions of the data have not been cleaned. Generally, DBMS will process this part of the data by combining VAC
: Transaction-level transaction dimension GC generally records the collection of data rows it reads and writes through transactions, and DBMS needs to monitor whether the version data
in the collection is outdated. When the version data collection of a transaction is not visible to all active transactions, the version data in the collection can be GC cleaned.
Fig 11: MVCC Transaction-level
Summary Tuple-level VAC is the most commonly used version data cleaning scheme, which can improve its performance by increasing the number of GC
threads. However, compared to Transaction-level GC, this scheme will have a greater impact on performance when there is a long transaction, because a long transaction means that all version data after it starts is not cleaned, and the version data chain will become very long until the long transaction is committed or aborted.
DBMSs that support MVCC store version data and index data separately. We can think of the index as a KV key-value pair, the key is the field value (such as ID) in the data row being indexed, and the value is the pointer to the corresponding data row of the index.
The case of primary key
indexes is relatively simple, because the primary key (which is also the key of the index) remains unchanged, and the value of the index always points to the beginning of the version data chain, such as in InnoDB, the data row of the primary key index points to the primary table. In a primary key index, what happens to the key-value pair pointer of the index depends on which data storage method is used by the DBMS.
For Delta storage, as we discussed above, the master version data is always stored in the main table, it is updated in place, so when the data is updated, the data row position in the main table does not change, so the pointer to the index Value does not change.
For Append-only storage, the version data chain has two different directions:
needs to be adjusted, at this time DBMS will generally perform a DELETE & INSERT operation on the index to complete the adjustment
will the pointer address N2O be adjusted, whenever a new version is generated, the pointer of the index value
For the auxiliary index,
the field value (which is also the key in the index) may change, The version data pointed to by the Index Value may also change. So there are different scenarios for managing pointers in an index.
most common solution for logical pointers is to create a layer of intermediate tables so that the value of the index can remain unchanged, such as pointing to a primary key. This scheme is also used by InnoDB, so we usually say that the secondary index will contain the indexed value as well as the primary key value. The pointer in the index is fixed by the primary key value, so that whenever the starting point of the version data linked list changes, only the pointer corresponding to the primary key value needs to be updated at the same time. Although when there is only one secondary index, it sounds like the complexity of the change is the same, all changing 1 pointer, but when there are many secondary indexes, it will be O(1) vs. O(n) difference.
The disadvantage of using primary keys is that the volume of the secondary index changes with the volume of the primary key, and another option is to provide a 64-bit unique ID for logical tuple. In fact, there is no difference in the idea, it is to add a layer of mapping between the physical version chain and the index, just to see how the mapped content picks a unique fixed, space-saving value.
once published an article called “Why Uber Engineering Switched from Postgres to MySQL”, but they didn’t actually use Postgres from the beginning. Uber also started using MySQL, hiring some engineers who were passionate about Postgres, so they switched from MySQL to Postgres. They added a lot of secondary indexes to the table, and found that Postgres’ secondary index pointed to the beginning of the version chain on disk, and when the beginning of the version chain changed, the pointers of multiple secondary indexes were modified. In the Append-only storage mode, the advantage of this design is that there is no return table query through the middle layer mapping (such as primary key ID), and the disadvantage is also very obvious, when there are many auxiliary indexes, the change at the beginning of the version data link will cause the pointers of all secondary indexes to need to be updated.
There are also DBMS that use this solution, such as MemSQL and Hekaton. If Uber’s engineers had read the paper, they might have saved a lot of money on migration costs.
, different management methods also have their own suitable scenarios. For Logical Pointer, it performs better in the scenario of frequent writes, because the secondary index only needs to change when the value of the index field changes, and in the read scenario, it also needs to compare different version chains, because the same logical value may correspond to different physical values, such as DELETE and then INSERT the same value; For Physical Pointers, the disadvantage has been mentioned earlier, and frequent update scenarios can lead to poor write performance, but they have their advantages when retrieving.
In addition, in the DBMS that supports MVCC, the so-called “overlay index” cannot actually get the query result by scanning a single index, because there is no version data in the index. For Append-only designs, going back to the main table for retrieval is a must; For Delta storage, you also need to find the corresponding version in the Delta space (such as undo log).
paper categorizes and summarizes the four main points of MVCC
concurrency control protocols: MVTO, MVOCC, MV2PL and Serialization Certifier
Version data storage: Append-only, Time-Travel, and Delta
garbage cleanup mechanisms: Tuple-level and transaction-level
index management: Logical pointers and Physical pointers
The intention is to show that MVCC does not have a set of standard implementations, and different implementations have different performances for specific scenarios of Workload. In Paper Reading, we do not show the test results of different approaches in the paper, and interested students can find the original text in the link at the end of the article.
In addition, although some common implementations can be summarized from different DBMSs, they each have further different optimizations, such as undo log management and GC in InnoDB are much more complex than the original overview.
 Y. Wu, An Empirical Evaluation of In-Memory Multi-Version Concurrency Control. In VLDB 2017.
 J. Böttcher, Scalable Garbage Collection for In-Memory MVCC Systems. In VLDB 2019.
 CMU 15-721 Advanced Database Systems.